Hierarchical Synergy-Enhanced Multimodal Relational Network for VideoQuestion Answering

doi:10.1145/3630101

CSpace

	Hierarchical Synergy-Enhanced Multimodal Relational Network for VideoQuestion Answering
	Peng, Min1,2 ; Shao, Xiaohu3 ; Shi, Yu1 ; Zhou, Xiangdong1
	2024-04-01
摘要	Video question answering (VideoQA) is challenging as it requires reasoning about natural language and multi-modal interactive relations. Most existing methods apply attention mechanisms to extract interactions between the question and the video or to extract effective spatio-temporal relational representations. However, these methods neglect the implication of relations between intra- and inter-modal interactions for multimodal learning, and they fail to fully exploit the synergistic effect of multiscale semantics in answer reasoning. In this article, we propose a novel hierarchical synergy-enhanced multimodal relational network (HMRNet) to address these issues. Specifically, we devise (i) a compact and unified relation-oriented interaction module that explores the relation between intra- and inter-modal interactions to enable effective multimodal learning; and (ii) a hierarchical synergistic memory unit that leverages a memory-based interaction scheme to complement and fuse multimodal semantics at multiple scales to achieve synergistic enhancement of answer reasoning. With careful design of each component, our HMRNet has fewer parameters and is computationally efficient. Extensive experiments and qualitative analyses demonstrate that the HMRNet is superior to previous state-of-the-art methods on eight benchmark datasets. We also demonstrate the effectiveness of the different components of our method.
关键词	Video question answering multimodal learning attention mechanisms multiscale semantics
DOI	10.1145/3630101
发表期刊	ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS
ISSN	1551-6857
卷号	20 期号:4 页码:22
通讯作者	Peng, Min(pengmin@cigit.ac.cn)
收录类别	SCI
WOS记录号	WOS:001153382100001
语种	en

中国科学院重庆绿色智能技术研究院机构知识库