CSpace
Hierarchical Spatial-Temporal Adaptive Graph Fusion for Monocular 3D Human Pose Estimation
Zhang, Lijun1,2; Lu, Feng3,4; Zhou, Kangkang1,2; Zhou, Xiang-Dong1,2; Shi, Yu1,2
2024
摘要Single-view 3D human pose estimation (HPE) based on Graph Convolutional Networks currently suffers from problems such as insufficient feature representation and depth ambiguity. To address these issues, this letter proposes a hierarchical spatial-temporal adaptive graph fusion framework to improve monocular 3D HPE performance. Firstly, to enhance the spatial semantic feature representation of human nodes, a progressive adaptive graph feature capture strategy is developed, which adaptively constructs global-to-local attention graph features of all human joints in a coarse-to-fine manner. A spatial-temporal attention fusion module is then constructed to model long-term sequential dependencies and mitigate depth ambiguity. The temporal attention factors of related frames are captured and utilized to intermediately supervise the joint depth. The spatial semantic information among all joints in the same frame and temporal contextual knowledge of the joints across relevant frames are fused to build spatial-temporal correlations and optimize the final features. Extensive experiments on two popular benchmarks show that our method outperforms several state-of-the-art approaches and improves 3D HPE performance.
关键词3D human pose estimation attention mechanism graph convolutional network spatial-temporal fusion
DOI10.1109/LSP.2023.3339060
发表期刊IEEE SIGNAL PROCESSING LETTERS
ISSN1070-9908
卷号31页码:61-65
通讯作者Zhou, Xiang-Dong(zhouxiangdong@cigit.ac.cn)
收录类别SCI
WOS记录号WOS:001138710200016
语种英语