广西师范大学学报(自然科学版) ›› 2025, Vol. 43 ›› Issue (5): 130-144.doi: 10.16088/j.issn.1001-6600.2024122505

• 智能信息处理 • 上一篇    下一篇

基于时空注意力的3D人体姿态估计网络设计

易见兵1,2*, 张裕贤1,2, 曹锋1,2, 李俊1,2, 彭鑫1,2, 陈鑫1,2   

  1. 1.江西理工大学 信息工程学院, 江西 赣州 341000;
    2.多维智能感知与控制江西省重点实验室(江西理工大学), 江西 赣州 341000
  • 收稿日期:2024-12-25 修回日期:2025-03-14 出版日期:2025-09-05 发布日期:2025-08-05
  • 通讯作者: 易见兵(1980—), 男, 江西宜春人, 江西理工大学副教授, 博士。E-mail: yijianbing8@jxust.edu.cn
  • 基金资助:
    国家自然科学基金(62066018, 62366017); 江西省自然科学基金(20181BAB202004); 江西省研究生创新专项资金(YC2023-S662)

Design of 3D Human Pose Estimation Network Based on Spatio-Temporal Attention

YI Jianbing1,2*, ZHANG Yuxian1,2, CAO Feng1,2, LI Jun1,2, PENG Xin1,2, CHEN Xin1,2   

  1. 1. College of Information Engineering, Jiangxi University of Science and Technology, Ganzhou Jiangxi 341000, China;
    2. Jiangxi Provincial Key Laboratory of Multidimensional Intelligent Perception and Control (Jiangxi University of Science and Technology), Ganzhou Jiangxi 341000, China
  • Received:2024-12-25 Revised:2025-03-14 Online:2025-09-05 Published:2025-08-05

摘要: 在3D人体姿态估计中,遮挡会导致人体关节点提取不准确,针对该问题,本文提出一种结合时空注意力和通道注意力的3D人体姿态估计算法。首先,提出一种特征筛选模块,该模块通过引入位置嵌入模块,以进一步捕获人体关节点的特征信息;其次,提出一种移动视觉Transformer时间注意力模块,该模块通过引入SiLU激活函数,以获取更多姿态特征细节;最后,提出一种通道注意力模块,该模块通过引入并行分支处理架构及增加归一化层,以调整输出通道的特征权重,达到算法对人体姿态特征的关注和弱化其背景特征的目的。在Human3.6M数据集上进行实验,相较于基准模型Strided Transformer,将级联金字塔网络提取的2D关节点作为输入时,每关节位置误差的平均值和进行普罗克鲁斯对齐后的每关节位置误差的平均值分别下降2.5%和2.3%;将Human3.6M数据集标注的2D关节点作为输入时,每关节位置误差的平均值下降6.7%。实验结果表明,本文提出的算法准确性较高。

关键词: 3D人体姿态估计, 遮挡, 时空注意力, 通道注意力, Transformer

Abstract: In the field of 3D human pose estimation, occlusion leads to inaccurate extraction of human joint points. To address this problem, this paper proposes a 3D human pose estimation algorithm that combines spatio-temporal attention and channel attention. Firstly, a feature filtering module is proposed, which further captures the feature information of human joint points by introducing the position embedding module. Then, a mobile vision transformer temporal attention module is proposed, which can obtain more details of pose features by introducing the SiLU activation function. Finally, a channel attention module is proposed, which adjusts the weights of the output channel features by introducing a parallel branch processing architecture and adding normalization layers, so that the algorithm can focus on human pose features while reducing the influence of background features. Experiments are conducted on the Human3.6M dataset. Compared with the baseline model Strided Transformer, the mean per joint position error (MPJPE) and the procrustes-aligned mean per joint position error (P-MPJPE) decrease by 2.5% and 2.3%, respectively, when the 2D joint points extracted from the cascaded pyramid network (CPN) are used as input. The MPJPE decrease by 6.7% when the annotated 2D joint points of the Human3.6M dataset are used as input. Experimental results show that the proposed algorithm has high accuracy.

Key words: 3D human pose estimation, occlusion, spatio-temporal attention, channel attention, Transformer

中图分类号:  TP391.41

[1] 刘琼, 何建航, 温嘉校. 联合静动态关节关系的3D人体姿态估计[J]. 北京邮电大学学报, 2024, 47(5): 35-43. DOI: 10.13190/j.jbupt.2023-150.
[2] 李佳宁, 王东凯, 张史梁. 基于深度学习的二维人体姿态估计: 现状及展望[J]. 计算机学报, 2024, 47(1): 231-250. DOI: 10.11897/SP.J.1016.2024.00231.
[3] QIU Z W, QIU K, FU J L, et al. Weakly-supervised pre-training for 3D human pose estimation via perspective knowledge[J]. Pattern Recognition, 2023, 139: 109497. DOI: 10.1016/j.patcog.2023.109497.
[4] 朱妍, 汪楷, 汪粼波, 等. 联合注意力和条件GAN的被遮挡人体姿态和体形估计方法[J]. 计算机辅助设计与图形学学报, 2024, 36(1): 142-151. DOI: 10.3724/SP.J.1089.2024.19863.
[5] FANG Q, XU Z H, HU M X, et al. SPGformer: serial-parallel hybrid GCN-transformer with graph-oriented encoder for 2-D-to-3-D human pose estimation[J]. IEEE Transactions on Instrumentation and Measurement, 2024, 73: 8003015. DOI: 10.1109/TIM.2024.3381701.
[6] 万云翀, 宋云鹏, 刘利刚. 基于体素联合坐标的单人三维姿态估计[J]. 计算机辅助设计与图形学学报, 2022, 34(9): 1411-1419. DOI: 10.3724/SP.J.1089.2022.19167.
[7] LI W H, LIU H, DING R W, et al. Exploiting temporal contexts with strided transformer for 3D human pose estimation[J]. IEEE Transactions on Multimedia, 2023, 25: 1282-1293. DOI: 10.1109/TMM.2022.3141231.
[8] ZHENG C, ZHU S J, MENDIETA M, et al. 3D human pose estimation with spatial and temporal transformers[C]// 2021 IEEE/CVF International Conference on Computer Vision (ICCV). Los Alamitos, CA: IEEE Computer Society, 2021: 11636-11645. DOI: 10.1109/ICCV48922.2021.01145.
[9] LI W H, LIU H, TANG H, et al.MHFormer: multi-hypothesis transformer for 3D human pose estimation[C]// 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Los Alamitos, CA: IEEE Computer Society, 2022: 13137-13146. DOI: 10.1109/CVPR52688.2022.01280.
[10] SHAN W K, LIU Z H, ZHANG X F, et al.P-STMO: pre-trained spatial temporal many-to-one model for 3D human pose estimation[C]// Computer Vision-ECCV2022(LNCS Volume 13665). Cham: Springer, 2022: 461-478. DOI: 10.1007/978-3-031-20065-6_27.
[11] CHENG C, XU H H. Human pose estimation in complex background videos via Transformer-based multi-scale feature integration[J]. Displays, 2024, 84: 102805. DOI: 10.1016/j.displa.2024.102805.
[12] 何建航, 孙郡瑤, 刘琼. 基于人体和场景上下文的多人3D姿态估计[J]. 软件学报, 2024, 35(4): 2039-2054. DOI:10.13328/j.cnki.jos.006837.
[13] 唐福梅, 聂勇伟, 余嘉祺, 等. 伪时空图卷积网络修复姿态引导的Transformer行人视频修复方法[J]. 计算机辅助设计与图形学学报, 2024, 36(4): 552-564. DOI: 10.3724/SP.J.1089.2024.19773.
[14] 叶俊, 张云. 基于时空多特征融合网络的三维人体姿态估计[J]. 光电子·激光, 2022, 33(12): 1306-1314. DOI: 10.16136/j.joel.2022.12.0101.
[15] 杨韫韬, 聂勇伟, 张青, 等. 基于RNN和注意力机制的双向人体姿态补全方法[J]. 计算机辅助设计与图形学学报, 2022, 34(11): 1772-1783. DOI: 10.3724/SP.J.1089.2022.19196.
[16] YANG X C, LAN Z P, WANG N, et al. LiteFer: an approach based on MobileViT expression recognition[J]. Sensors, 2024, 24(18): 5868. DOI: 10.3390/s24185868.
[17] 张淑芳, 赖双意, 刘嫣然. 基于多监督的三维人体姿势与形状预测[J]. 天津大学学报(自然科学与工程技术版), 2024, 57(2): 147-154. DOI: 10.11784/tdxbz202211011.
[18] 马金林, 崔琦磊, 马自萍, 等. 预加权调制密集图卷积网络三维人体姿态估计[J]. 计算机科学与探索, 2024, 18(4): 963-977. DOI: 10.3778/j.issn.1673-9418.2302065.
[19] 卫娜, 焦明海. 融合双通道关节约束的三维人体姿态估计[J]. 计算机工程与应用, 2024, 60(23):146-154. DOI: 10.3778/j.issn.1002-8331.2308-0087.
[20] FAN L L, JIANG K L, ZHOU W X, et al. 3D human pose estimation from video via multi-scale multi-level spatial temporal features[J]. Multimedia Tools and Applications, 2024, 83(29): 73533-73552. DOI: 10.1007/s11042-023-17955-6.
[21] EINFALT M, LUDWIG K, LIENHART R. Uplift and upsample: efficient 3D human pose estimation with uplifting transformers[C]// 2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV). Los Alamitos, CA: IEEE Computer Society, 2023: 2902-2912. DOI: 10.1109/WACV56688.2023.00292.
[22] SHAN W K, LU H P, WANG S S, et al. Improving robustness and accuracy via relative information encoding in 3D human pose estimation[C]// MM'21: Proceedings of the 29th ACM International Conference on Multimedia. New York, NY: Association for Computing Machinery, 2021: 3446-3454. DOI:10.1145/3474085.3475504.
[23] TANG Z H, HAO Y B, LI J, et al. FTCM: frequency-temporal collaborative module for efficient 3D human pose estimation in video[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2024, 34(2): 911-923. DOI: 10.1109/TCSVT.2023.3286402.
[24] LIU R X, SHEN J, WANG H, et al. Attention mechanism exploits temporal contexts:real-time 3D human pose reconstruction[C]// 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR). Los Alamitos, CA: IEEE Computer Society, 2020: 5063-5072. DOI: 10.1109/CVPR42600.2020.00511.
[25] ISLAM Z, BEN HAMZA A. Multi-hop graph transformer network for 3D human pose estimation[J]. Journal of Visual Communication and Image Representation, 2024, 101: 104174. DOI: 10.1016/j.jvcir.2024.104174.
[26] 黄程远, 宋晓宁, 冯振华. ARGP-Pose: 基于关键点间关系分析与分组预测的3D人体姿态估计[J]. 计算机应用研究, 2022, 39(7): 2178-2182, 2202. DOI:10.19734/j.issn.1001-3695.2021.11.0618.
[27] ZENG A L, SUN X, HUANG F Y, et al. SRNet:improving generalization in 3D human pose estimation with a split-and-recombine approach[C]// Computer Vision-ECCV 2020: LNIP Volume 12359. Cham: Springer, 2020: 507-523. DOI: 10.1007/978-3-030-58568-6_30.
[28] CHEN T L, FANG C, SHEN X H, et al. Anatomy-aware 3D human pose estimation with bone-based pose decomposition[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2022, 32(1): 198-209. DOI: 10.1109/TCSVT.2021.3057267.
[29] CHEN S, XU Y X, ZOU B J. Prior-knowledge-based self-attention network for 3D human pose estimation[J]. Expert Systems with Applications, 2023, 225: 120213. DOI: 10.1016/j.eswa.2023.120213.
[30] JIA R, YANG H H, ZHAO L, et al. MPA-GNet: multi-scale parallel adaptive graph network for 3D human pose estimation[J]. The Visual Computer, 2024, 40(8): 5883-5899. DOI: 10.1007/s00371-023-03142-z.
[31] WANG Y, KANG H B, WU D D, et al. Global and local spatio-temporal encoder for 3D human pose estimation[J]. IEEE Transactions on Multimedia, 2023, 26: 4039-4049. DOI: 10.1109/TMM.2023.3321438.
[32] 薛峰, 边福利, 李书杰. 面向三维人体坐标及旋转角估计的注意力融合网络[J]. 中国图象图形学报, 2024, 29(10): 3116-3129. DOI: 10.11834/jig.230502.
[33] PAVLLO D, FEICHTENHOFER C, GRANGIER D, et al. 3D human pose estimation in video with temporal convolutions and semi-supervised training[C]// 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Los Alamitos, CA: IEEE Computer Society, 2019: 7745-7754. DOI: 10.1109/CVPR.2019.00794.
[1] 黎豊玮, 谭玉枚, 宋树祥, 夏海英. 基于注意力引导的遮挡感知面部表情识别[J]. 广西师范大学学报(自然科学版), 2025, 43(5): 104-113.
[2] 刘廷汉, 梁艳, 黄鹏升, 闭金杰, 黄守麟, 李廷会. 基于改进YOLOv8s的人脸痤疮小目标检测[J]. 广西师范大学学报(自然科学版), 2025, 43(5): 114-129.
[3] 韩烁, 江林峰, 杨建斌. 基于注意力机制PINNs方法求解圣维南方程[J]. 广西师范大学学报(自然科学版), 2025, 43(4): 58-68.
[4] 王旭阳, 章家瑜. 基于跨模态增强网络的时序多模态情感分析[J]. 广西师范大学学报(自然科学版), 2025, 43(4): 97-107.
[5] 赵伟, 田帅, 张强, 王耀申, 王思博, 宋江. 基于改进YOLOv5的平贝母检测模型[J]. 广西师范大学学报(自然科学版), 2023, 41(6): 22-32.
[6] 孙旭, 沈彬, 严馨, 张金鹏, 徐广义. 基于Transformer和TextRank的微博观点摘要方法[J]. 广西师范大学学报(自然科学版), 2023, 41(4): 96-108.
[7] 魏明军, 周太宇, 纪占林, 张鑫楠. 基于YOLOv3的公共场所口罩佩戴检测方法[J]. 广西师范大学学报(自然科学版), 2023, 41(1): 76-86.
[8] 张文龙, 南新元. 基于改进YOLOv5的道路车辆跟踪算法[J]. 广西师范大学学报(自然科学版), 2022, 40(2): 49-57.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 钟俏, 陈生龙, 唐聪聪. 水凝胶技术在微藻采收中的应用:现状、挑战与发展分析[J]. 广西师范大学学报(自然科学版), 2024, 42(6): 16 -29 .
[2] 施慧露, 莫燕华, 骆海玉, 马姜明. 檵木乙酸乙酯萃取物抑菌活性研究[J]. 广西师范大学学报(自然科学版), 2025, 43(1): 1 -8 .
[3] 贺青, 李栋, 罗思源, 贺寓东, 李彪, 王强. 超宽带里德堡原子天线技术研究进展[J]. 广西师范大学学报(自然科学版), 2025, 43(2): 1 -19 .
[4] 黄仁慧, 张锐锋, 文晓浩, 闭金杰, 黄守麟, 李廷会. 基于复数协方差卷积神经网络的运动想象脑电信号解码方法[J]. 广西师范大学学报(自然科学版), 2025, 43(3): 43 -56 .
[5] 田晟, 熊辰崟, 龙安洋. 基于改进PointNet++的城市道路点云分类方法[J]. 广西师范大学学报(自然科学版), 2025, 43(4): 1 -14 .
[6] 黎宗孝, 张健, 罗鑫悦, 赵嶷飞, 卢飞. 基于K-means和Adam-LSTM的机场进场航迹预测研究[J]. 广西师范大学学报(自然科学版), 2025, 43(4): 15 -23 .
[7] 宋铭楷, 朱成杰. 基于H-WOA-GWO和区段修正策略的配电网故障定位研究[J]. 广西师范大学学报(自然科学版), 2025, 43(4): 24 -37 .
[8] 韩烁, 江林峰, 杨建斌. 基于注意力机制PINNs方法求解圣维南方程[J]. 广西师范大学学报(自然科学版), 2025, 43(4): 58 -68 .
[9] 李志欣, 匡文兰. 结合互注意力空间自适应和特征对集成判别的细粒度图像分类[J]. 广西师范大学学报(自然科学版), 2025, 43(4): 69 -82 .
[10] 石天怡, 南新元, 郭翔羽, 赵濮, 蔡鑫. 基于改进ConvNeXt的苹果叶片病害分类算法[J]. 广西师范大学学报(自然科学版), 2025, 43(4): 83 -96 .
版权所有 © 广西师范大学学报(自然科学版)编辑部
地址:广西桂林市三里店育才路15号 邮编:541004
电话:0773-5857325 E-mail: gxsdzkb@mailbox.gxnu.edu.cn
本系统由北京玛格泰克科技发展有限公司设计开发