CSpace
Hierarchical Synergy-Enhanced Multimodal Relational Network for VideoQuestion Answering
Peng, Min1,2; Shao, Xiaohu3; Shi, Yu1; Zhou, Xiangdong1
2024-04-01
摘要Video question answering (VideoQA) is challenging as it requires reasoning about natural language and multi-modal interactive relations. Most existing methods apply attention mechanisms to extract interactions between the question and the video or to extract effective spatio-temporal relational representations. However, these methods neglect the implication of relations between intra- and inter-modal interactions for multimodal learning, and they fail to fully exploit the synergistic effect of multiscale semantics in answer reasoning. In this article, we propose a novel hierarchical synergy-enhanced multimodal relational network (HMRNet) to address these issues. Specifically, we devise (i) a compact and unified relation-oriented interaction module that explores the relation between intra- and inter-modal interactions to enable effective multimodal learning; and (ii) a hierarchical synergistic memory unit that leverages a memory-based interaction scheme to complement and fuse multimodal semantics at multiple scales to achieve synergistic enhancement of answer reasoning. With careful design of each component, our HMRNet has fewer parameters and is computationally efficient. Extensive experiments and qualitative analyses demonstrate that the HMRNet is superior to previous state-of-the-art methods on eight benchmark datasets. We also demonstrate the effectiveness of the different components of our method.
关键词Video question answering multimodal learning attention mechanisms multiscale semantics
DOI10.1145/3630101
发表期刊ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS
ISSN1551-6857
卷号20期号:4页码:22
通讯作者Peng, Min(pengmin@cigit.ac.cn)
收录类别SCI
WOS记录号WOS:001153382100001
语种en