Journal of Guangxi Normal University(Natural Science Edition) ›› 2025, Vol. 43 ›› Issue (5): 130-144.doi: 10.16088/j.issn.1001-6600.2024122505

• Intelligence Information Processing • Previous Articles     Next Articles

Design of 3D Human Pose Estimation Network Based on Spatio-Temporal Attention

YI Jianbing1,2*, ZHANG Yuxian1,2, CAO Feng1,2, LI Jun1,2, PENG Xin1,2, CHEN Xin1,2   

  1. 1. College of Information Engineering, Jiangxi University of Science and Technology, Ganzhou Jiangxi 341000, China;
    2. Jiangxi Provincial Key Laboratory of Multidimensional Intelligent Perception and Control (Jiangxi University of Science and Technology), Ganzhou Jiangxi 341000, China
  • Received:2024-12-25 Revised:2025-03-14 Online:2025-09-05 Published:2025-08-05

Abstract: In the field of 3D human pose estimation, occlusion leads to inaccurate extraction of human joint points. To address this problem, this paper proposes a 3D human pose estimation algorithm that combines spatio-temporal attention and channel attention. Firstly, a feature filtering module is proposed, which further captures the feature information of human joint points by introducing the position embedding module. Then, a mobile vision transformer temporal attention module is proposed, which can obtain more details of pose features by introducing the SiLU activation function. Finally, a channel attention module is proposed, which adjusts the weights of the output channel features by introducing a parallel branch processing architecture and adding normalization layers, so that the algorithm can focus on human pose features while reducing the influence of background features. Experiments are conducted on the Human3.6M dataset. Compared with the baseline model Strided Transformer, the mean per joint position error (MPJPE) and the procrustes-aligned mean per joint position error (P-MPJPE) decrease by 2.5% and 2.3%, respectively, when the 2D joint points extracted from the cascaded pyramid network (CPN) are used as input. The MPJPE decrease by 6.7% when the annotated 2D joint points of the Human3.6M dataset are used as input. Experimental results show that the proposed algorithm has high accuracy.

Key words: 3D human pose estimation, occlusion, spatio-temporal attention, channel attention, Transformer

CLC Number:  TP391.41
[1] 刘琼, 何建航, 温嘉校. 联合静动态关节关系的3D人体姿态估计[J]. 北京邮电大学学报, 2024, 47(5): 35-43. DOI: 10.13190/j.jbupt.2023-150.
[2] 李佳宁, 王东凯, 张史梁. 基于深度学习的二维人体姿态估计: 现状及展望[J]. 计算机学报, 2024, 47(1): 231-250. DOI: 10.11897/SP.J.1016.2024.00231.
[3] QIU Z W, QIU K, FU J L, et al. Weakly-supervised pre-training for 3D human pose estimation via perspective knowledge[J]. Pattern Recognition, 2023, 139: 109497. DOI: 10.1016/j.patcog.2023.109497.
[4] 朱妍, 汪楷, 汪粼波, 等. 联合注意力和条件GAN的被遮挡人体姿态和体形估计方法[J]. 计算机辅助设计与图形学学报, 2024, 36(1): 142-151. DOI: 10.3724/SP.J.1089.2024.19863.
[5] FANG Q, XU Z H, HU M X, et al. SPGformer: serial-parallel hybrid GCN-transformer with graph-oriented encoder for 2-D-to-3-D human pose estimation[J]. IEEE Transactions on Instrumentation and Measurement, 2024, 73: 8003015. DOI: 10.1109/TIM.2024.3381701.
[6] 万云翀, 宋云鹏, 刘利刚. 基于体素联合坐标的单人三维姿态估计[J]. 计算机辅助设计与图形学学报, 2022, 34(9): 1411-1419. DOI: 10.3724/SP.J.1089.2022.19167.
[7] LI W H, LIU H, DING R W, et al. Exploiting temporal contexts with strided transformer for 3D human pose estimation[J]. IEEE Transactions on Multimedia, 2023, 25: 1282-1293. DOI: 10.1109/TMM.2022.3141231.
[8] ZHENG C, ZHU S J, MENDIETA M, et al. 3D human pose estimation with spatial and temporal transformers[C]// 2021 IEEE/CVF International Conference on Computer Vision (ICCV). Los Alamitos, CA: IEEE Computer Society, 2021: 11636-11645. DOI: 10.1109/ICCV48922.2021.01145.
[9] LI W H, LIU H, TANG H, et al.MHFormer: multi-hypothesis transformer for 3D human pose estimation[C]// 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Los Alamitos, CA: IEEE Computer Society, 2022: 13137-13146. DOI: 10.1109/CVPR52688.2022.01280.
[10] SHAN W K, LIU Z H, ZHANG X F, et al.P-STMO: pre-trained spatial temporal many-to-one model for 3D human pose estimation[C]// Computer Vision-ECCV2022(LNCS Volume 13665). Cham: Springer, 2022: 461-478. DOI: 10.1007/978-3-031-20065-6_27.
[11] CHENG C, XU H H. Human pose estimation in complex background videos via Transformer-based multi-scale feature integration[J]. Displays, 2024, 84: 102805. DOI: 10.1016/j.displa.2024.102805.
[12] 何建航, 孙郡瑤, 刘琼. 基于人体和场景上下文的多人3D姿态估计[J]. 软件学报, 2024, 35(4): 2039-2054. DOI:10.13328/j.cnki.jos.006837.
[13] 唐福梅, 聂勇伟, 余嘉祺, 等. 伪时空图卷积网络修复姿态引导的Transformer行人视频修复方法[J]. 计算机辅助设计与图形学学报, 2024, 36(4): 552-564. DOI: 10.3724/SP.J.1089.2024.19773.
[14] 叶俊, 张云. 基于时空多特征融合网络的三维人体姿态估计[J]. 光电子·激光, 2022, 33(12): 1306-1314. DOI: 10.16136/j.joel.2022.12.0101.
[15] 杨韫韬, 聂勇伟, 张青, 等. 基于RNN和注意力机制的双向人体姿态补全方法[J]. 计算机辅助设计与图形学学报, 2022, 34(11): 1772-1783. DOI: 10.3724/SP.J.1089.2022.19196.
[16] YANG X C, LAN Z P, WANG N, et al. LiteFer: an approach based on MobileViT expression recognition[J]. Sensors, 2024, 24(18): 5868. DOI: 10.3390/s24185868.
[17] 张淑芳, 赖双意, 刘嫣然. 基于多监督的三维人体姿势与形状预测[J]. 天津大学学报(自然科学与工程技术版), 2024, 57(2): 147-154. DOI: 10.11784/tdxbz202211011.
[18] 马金林, 崔琦磊, 马自萍, 等. 预加权调制密集图卷积网络三维人体姿态估计[J]. 计算机科学与探索, 2024, 18(4): 963-977. DOI: 10.3778/j.issn.1673-9418.2302065.
[19] 卫娜, 焦明海. 融合双通道关节约束的三维人体姿态估计[J]. 计算机工程与应用, 2024, 60(23):146-154. DOI: 10.3778/j.issn.1002-8331.2308-0087.
[20] FAN L L, JIANG K L, ZHOU W X, et al. 3D human pose estimation from video via multi-scale multi-level spatial temporal features[J]. Multimedia Tools and Applications, 2024, 83(29): 73533-73552. DOI: 10.1007/s11042-023-17955-6.
[21] EINFALT M, LUDWIG K, LIENHART R. Uplift and upsample: efficient 3D human pose estimation with uplifting transformers[C]// 2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV). Los Alamitos, CA: IEEE Computer Society, 2023: 2902-2912. DOI: 10.1109/WACV56688.2023.00292.
[22] SHAN W K, LU H P, WANG S S, et al. Improving robustness and accuracy via relative information encoding in 3D human pose estimation[C]// MM'21: Proceedings of the 29th ACM International Conference on Multimedia. New York, NY: Association for Computing Machinery, 2021: 3446-3454. DOI:10.1145/3474085.3475504.
[23] TANG Z H, HAO Y B, LI J, et al. FTCM: frequency-temporal collaborative module for efficient 3D human pose estimation in video[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2024, 34(2): 911-923. DOI: 10.1109/TCSVT.2023.3286402.
[24] LIU R X, SHEN J, WANG H, et al. Attention mechanism exploits temporal contexts:real-time 3D human pose reconstruction[C]// 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR). Los Alamitos, CA: IEEE Computer Society, 2020: 5063-5072. DOI: 10.1109/CVPR42600.2020.00511.
[25] ISLAM Z, BEN HAMZA A. Multi-hop graph transformer network for 3D human pose estimation[J]. Journal of Visual Communication and Image Representation, 2024, 101: 104174. DOI: 10.1016/j.jvcir.2024.104174.
[26] 黄程远, 宋晓宁, 冯振华. ARGP-Pose: 基于关键点间关系分析与分组预测的3D人体姿态估计[J]. 计算机应用研究, 2022, 39(7): 2178-2182, 2202. DOI:10.19734/j.issn.1001-3695.2021.11.0618.
[27] ZENG A L, SUN X, HUANG F Y, et al. SRNet:improving generalization in 3D human pose estimation with a split-and-recombine approach[C]// Computer Vision-ECCV 2020: LNIP Volume 12359. Cham: Springer, 2020: 507-523. DOI: 10.1007/978-3-030-58568-6_30.
[28] CHEN T L, FANG C, SHEN X H, et al. Anatomy-aware 3D human pose estimation with bone-based pose decomposition[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2022, 32(1): 198-209. DOI: 10.1109/TCSVT.2021.3057267.
[29] CHEN S, XU Y X, ZOU B J. Prior-knowledge-based self-attention network for 3D human pose estimation[J]. Expert Systems with Applications, 2023, 225: 120213. DOI: 10.1016/j.eswa.2023.120213.
[30] JIA R, YANG H H, ZHAO L, et al. MPA-GNet: multi-scale parallel adaptive graph network for 3D human pose estimation[J]. The Visual Computer, 2024, 40(8): 5883-5899. DOI: 10.1007/s00371-023-03142-z.
[31] WANG Y, KANG H B, WU D D, et al. Global and local spatio-temporal encoder for 3D human pose estimation[J]. IEEE Transactions on Multimedia, 2023, 26: 4039-4049. DOI: 10.1109/TMM.2023.3321438.
[32] 薛峰, 边福利, 李书杰. 面向三维人体坐标及旋转角估计的注意力融合网络[J]. 中国图象图形学报, 2024, 29(10): 3116-3129. DOI: 10.11834/jig.230502.
[33] PAVLLO D, FEICHTENHOFER C, GRANGIER D, et al. 3D human pose estimation in video with temporal convolutions and semi-supervised training[C]// 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Los Alamitos, CA: IEEE Computer Society, 2019: 7745-7754. DOI: 10.1109/CVPR.2019.00794.
[1] LI Fengwei, TAN Yumei, SONG Shuxiang, XIA Haiying. Occlusion-Aware Facial Expression Recognition Based on Attention Guidance [J]. Journal of Guangxi Normal University(Natural Science Edition), 2025, 43(5): 104-113.
[2] LIU Tinghan, LIANG Yan, HUANG Pengsheng, BI Jinjie, HUANG Shoulin, LI Tinghui. Facial Acne Detection for Small Object Based on Improved YOLOv8s [J]. Journal of Guangxi Normal University(Natural Science Edition), 2025, 43(5): 114-129.
[3] HAN Shuo, JIANG Linfeng, YANG Jianbin. Attention-based PINNs Method for Solving Saint-Venant Equations [J]. Journal of Guangxi Normal University(Natural Science Edition), 2025, 43(4): 58-68.
[4] WANG Xuyang, ZHANG Jiayu. Temporal Multimodal Sentiment Analysis with Cross-Modal Augmentation Networks [J]. Journal of Guangxi Normal University(Natural Science Edition), 2025, 43(4): 97-107.
[5] ZHAO Wei, TIAN Shuai, ZHANG Qiang, WANG Yaoshen, WANG Sibo, SONG Jiang. Fritillaria ussuriensis Maxim Detection Model Based on Improved YOLOv5 [J]. Journal of Guangxi Normal University(Natural Science Edition), 2023, 41(6): 22-32.
[6] SUN Xu, SHEN Bin, YAN Xin, ZHANG Jinpeng, XU Guangyi. Microblog Opinion Summarization Method Based on Transformer and TextRank [J]. Journal of Guangxi Normal University(Natural Science Edition), 2023, 41(4): 96-108.
[7] WEI Mingjun, ZHOU Taiyu, JI Zhanlin, ZHANG Xinnan. Detection Method of Mask Wearing in Public Places Based on YOLOv3 [J]. Journal of Guangxi Normal University(Natural Science Edition), 2023, 41(1): 76-86.
[8] ZHANG Wenlong, NAN Xinyuan. Road Vehicle Tracking Algorithm Based on Improved YOLOv5 [J]. Journal of Guangxi Normal University(Natural Science Edition), 2022, 40(2): 49-57.
[9] LIN Yue, LIU Tingzhang, HUANG Lirong, XI Xiaoye, PAN Jian. Anomalous State Detection of Power Transformer Basedon Bidirectional KL Distance Clustering Algorithm [J]. Journal of Guangxi Normal University(Natural Science Edition), 2018, 36(4): 20-26.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] ZHONG Qiao, CHEN Shenglong, TANG Congcong. Hydrogel Technology for Microalgae Collection: Status Overview, Challenges and Development Analysis[J]. Journal of Guangxi Normal University(Natural Science Edition), 2024, 42(6): 16 -29 .
[2] SHI Huilu, MO Yanhua, LUO Haiyu, MA Jiangming. Inhibitory Activity of Ethyl Acetate Extracts of Loropetalum chinense against Pathogens[J]. Journal of Guangxi Normal University(Natural Science Edition), 2025, 43(1): 1 -8 .
[3] HE Qing, LI Dong, LUO Siyuan, HE Yudong, LI Biao, WANG Qiang. Research Progress in Ultra-wideband Rydberg Atomic Antenna Technology[J]. Journal of Guangxi Normal University(Natural Science Edition), 2025, 43(2): 1 -19 .
[4] HUANG Renhui, ZHANG Ruifeng, WEN Xiaohao, BI Jinjie, HUANG Shoulin, LI Tinghui. Complex-value Covariance-based Convolutional Neural Network for Decoding Motor Imagery-based EEG Signals[J]. Journal of Guangxi Normal University(Natural Science Edition), 2025, 43(3): 43 -56 .
[5] TIAN Sheng, XIONG Chenyin, LONG Anyang. Point Cloud Classification Method of Urban Roads Based on Improved PointNet++[J]. Journal of Guangxi Normal University(Natural Science Edition), 2025, 43(4): 1 -14 .
[6] LI Zongxiao, ZHANG Jian, LUO Xinyue, ZHAO Yifei, LU Fei. Research on Arrival Trajectory Prediction Based on K-means and Adam-LSTM[J]. Journal of Guangxi Normal University(Natural Science Edition), 2025, 43(4): 15 -23 .
[7] SONG Mingkai, ZHU Chengjie. Research on Fault Location of Distribution Network Based on H-WOA-GWO and Region Correction Strategies[J]. Journal of Guangxi Normal University(Natural Science Edition), 2025, 43(4): 24 -37 .
[8] HAN Shuo, JIANG Linfeng, YANG Jianbin. Attention-based PINNs Method for Solving Saint-Venant Equations[J]. Journal of Guangxi Normal University(Natural Science Edition), 2025, 43(4): 58 -68 .
[9] LI Zhixin, KUANG Wenlan. Fine-grained Image Classification Combining Adaptive Spatial Mutual Attention and Feature Pair Integration Discrimination[J]. Journal of Guangxi Normal University(Natural Science Edition), 2025, 43(4): 69 -82 .
[10] SHI Tianyi, NAN Xinyuan, GUO Xiangyu, ZHAO Pu, CAI Xin. Improved ConvNeXt-based Algorithm for Apple Leaf Disease Classification[J]. Journal of Guangxi Normal University(Natural Science Edition), 2025, 43(4): 83 -96 .