KMS Chongqing Institute of Green and Intelligent Technology, CAS
Hierarchical Synergy-Enhanced Multimodal Relational Network for VideoQuestion Answering | |
Peng, Min1,2; Shao, Xiaohu3; Shi, Yu1; Zhou, Xiangdong1 | |
2024-04-01 | |
摘要 | Video question answering (VideoQA) is challenging as it requires reasoning about natural language and multi-modal interactive relations. Most existing methods apply attention mechanisms to extract interactions between the question and the video or to extract effective spatio-temporal relational representations. However, these methods neglect the implication of relations between intra- and inter-modal interactions for multimodal learning, and they fail to fully exploit the synergistic effect of multiscale semantics in answer reasoning. In this article, we propose a novel hierarchical synergy-enhanced multimodal relational network (HMRNet) to address these issues. Specifically, we devise (i) a compact and unified relation-oriented interaction module that explores the relation between intra- and inter-modal interactions to enable effective multimodal learning; and (ii) a hierarchical synergistic memory unit that leverages a memory-based interaction scheme to complement and fuse multimodal semantics at multiple scales to achieve synergistic enhancement of answer reasoning. With careful design of each component, our HMRNet has fewer parameters and is computationally efficient. Extensive experiments and qualitative analyses demonstrate that the HMRNet is superior to previous state-of-the-art methods on eight benchmark datasets. We also demonstrate the effectiveness of the different components of our method. |
关键词 | Video question answering multimodal learning attention mechanisms multiscale semantics |
DOI | 10.1145/3630101 |
发表期刊 | ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS |
ISSN | 1551-6857 |
卷号 | 20期号:4页码:22 |
通讯作者 | Peng, Min(pengmin@cigit.ac.cn) |
收录类别 | SCI |
WOS记录号 | WOS:001153382100001 |
语种 | en |