Hierarchical Spatial-Temporal Adaptive Graph Fusion for Monocular 3D Human Pose Estimation

doi:10.1109/LSP.2023.3339060

CSpace

	Hierarchical Spatial-Temporal Adaptive Graph Fusion for Monocular 3D Human Pose Estimation
	Zhang, Lijun1,2 ; Lu, Feng 3,4; Zhou, Kangkang 1,2; Zhou, Xiang-Dong1,2 ; Shi, Yu1,2
	2024
摘要	Single-view 3D human pose estimation (HPE) based on Graph Convolutional Networks currently suffers from problems such as insufficient feature representation and depth ambiguity. To address these issues, this letter proposes a hierarchical spatial-temporal adaptive graph fusion framework to improve monocular 3D HPE performance. Firstly, to enhance the spatial semantic feature representation of human nodes, a progressive adaptive graph feature capture strategy is developed, which adaptively constructs global-to-local attention graph features of all human joints in a coarse-to-fine manner. A spatial-temporal attention fusion module is then constructed to model long-term sequential dependencies and mitigate depth ambiguity. The temporal attention factors of related frames are captured and utilized to intermediately supervise the joint depth. The spatial semantic information among all joints in the same frame and temporal contextual knowledge of the joints across relevant frames are fused to build spatial-temporal correlations and optimize the final features. Extensive experiments on two popular benchmarks show that our method outperforms several state-of-the-art approaches and improves 3D HPE performance.
关键词	3D human pose estimation attention mechanism graph convolutional network spatial-temporal fusion
DOI	10.1109/LSP.2023.3339060
发表期刊	IEEE SIGNAL PROCESSING LETTERS
ISSN	1070-9908
卷号	31 页码:61-65
通讯作者	Zhou, Xiang-Dong(zhouxiangdong@cigit.ac.cn)
收录类别	SCI
WOS记录号	WOS:001138710200016
语种	英语

中国科学院重庆绿色智能技术研究院机构知识库