基于时空注意力的3D人体姿态估计网络设计

doi:10.16088/j.issn.1001-6600.2024122505

摘要/Abstract

摘要： 在3D人体姿态估计中,遮挡会导致人体关节点提取不准确,针对该问题,本文提出一种结合时空注意力和通道注意力的3D人体姿态估计算法。首先,提出一种特征筛选模块,该模块通过引入位置嵌入模块,以进一步捕获人体关节点的特征信息;其次,提出一种移动视觉Transformer时间注意力模块,该模块通过引入SiLU激活函数,以获取更多姿态特征细节;最后,提出一种通道注意力模块,该模块通过引入并行分支处理架构及增加归一化层,以调整输出通道的特征权重,达到算法对人体姿态特征的关注和弱化其背景特征的目的。在Human3.6M数据集上进行实验,相较于基准模型Strided Transformer,将级联金字塔网络提取的2D关节点作为输入时,每关节位置误差的平均值和进行普罗克鲁斯对齐后的每关节位置误差的平均值分别下降2.5%和2.3%;将Human3.6M数据集标注的2D关节点作为输入时,每关节位置误差的平均值下降6.7%。实验结果表明,本文提出的算法准确性较高。

关键词: 3D人体姿态估计, 遮挡, 时空注意力, 通道注意力, Transformer

Abstract: In the field of 3D human pose estimation, occlusion leads to inaccurate extraction of human joint points. To address this problem, this paper proposes a 3D human pose estimation algorithm that combines spatio-temporal attention and channel attention. Firstly, a feature filtering module is proposed, which further captures the feature information of human joint points by introducing the position embedding module. Then, a mobile vision transformer temporal attention module is proposed, which can obtain more details of pose features by introducing the SiLU activation function. Finally, a channel attention module is proposed, which adjusts the weights of the output channel features by introducing a parallel branch processing architecture and adding normalization layers, so that the algorithm can focus on human pose features while reducing the influence of background features. Experiments are conducted on the Human3.6M dataset. Compared with the baseline model Strided Transformer, the mean per joint position error (MPJPE) and the procrustes-aligned mean per joint position error (P-MPJPE) decrease by 2.5% and 2.3%, respectively, when the 2D joint points extracted from the cascaded pyramid network (CPN) are used as input. The MPJPE decrease by 6.7% when the annotated 2D joint points of the Human3.6M dataset are used as input. Experimental results show that the proposed algorithm has high accuracy.

Key words: 3D human pose estimation, occlusion, spatio-temporal attention, channel attention, Transformer

中图分类号: TP391.41

易见兵, 张裕贤, 曹锋, 李俊, 彭鑫, 陈鑫. 基于时空注意力的3D人体姿态估计网络设计[J]. 广西师范大学学报（自然科学版）, 2025, 43(5): 130-144.

YI Jianbing, ZHANG Yuxian, CAO Feng, LI Jun, PENG Xin, CHEN Xin. Design of 3D Human Pose Estimation Network Based on Spatio-Temporal Attention[J]. Journal of Guangxi Normal University(Natural Science Edition), 2025, 43(5): 130-144.

参考文献

[1] 刘琼, 何建航, 温嘉校. 联合静动态关节关系的3D人体姿态估计[J]. 北京邮电大学学报, 2024, 47(5): 35-43. DOI: 10.13190/j.jbupt.2023-150.
[2] 李佳宁, 王东凯, 张史梁. 基于深度学习的二维人体姿态估计: 现状及展望[J]. 计算机学报, 2024, 47(1): 231-250. DOI: 10.11897/SP.J.1016.2024.00231.
[3] QIU Z W, QIU K, FU J L, et al. Weakly-supervised pre-training for 3D human pose estimation via perspective knowledge[J]. Pattern Recognition, 2023, 139: 109497. DOI: 10.1016/j.patcog.2023.109497.
[4] 朱妍, 汪楷, 汪粼波, 等. 联合注意力和条件GAN的被遮挡人体姿态和体形估计方法[J]. 计算机辅助设计与图形学学报, 2024, 36(1): 142-151. DOI: 10.3724/SP.J.1089.2024.19863.
[5] FANG Q, XU Z H, HU M X, et al. SPGformer: serial-parallel hybrid GCN-transformer with graph-oriented encoder for 2-D-to-3-D human pose estimation[J]. IEEE Transactions on Instrumentation and Measurement, 2024, 73: 8003015. DOI: 10.1109/TIM.2024.3381701.
[6] 万云翀, 宋云鹏, 刘利刚. 基于体素联合坐标的单人三维姿态估计[J]. 计算机辅助设计与图形学学报, 2022, 34(9): 1411-1419. DOI: 10.3724/SP.J.1089.2022.19167.
[7] LI W H, LIU H, DING R W, et al. Exploiting temporal contexts with strided transformer for 3D human pose estimation[J]. IEEE Transactions on Multimedia, 2023, 25: 1282-1293. DOI: 10.1109/TMM.2022.3141231.
[8] ZHENG C, ZHU S J, MENDIETA M, et al. 3D human pose estimation with spatial and temporal transformers[C]// 2021 IEEE/CVF International Conference on Computer Vision (ICCV). Los Alamitos, CA: IEEE Computer Society, 2021: 11636-11645. DOI: 10.1109/ICCV48922.2021.01145.
[9] LI W H, LIU H, TANG H, et al.MHFormer: multi-hypothesis transformer for 3D human pose estimation[C]// 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Los Alamitos, CA: IEEE Computer Society, 2022: 13137-13146. DOI: 10.1109/CVPR52688.2022.01280.
[10] SHAN W K, LIU Z H, ZHANG X F, et al.P-STMO: pre-trained spatial temporal many-to-one model for 3D human pose estimation[C]// Computer Vision-ECCV2022(LNCS Volume 13665). Cham: Springer, 2022: 461-478. DOI: 10.1007/978-3-031-20065-6_27.
[11] CHENG C, XU H H. Human pose estimation in complex background videos via Transformer-based multi-scale feature integration[J]. Displays, 2024, 84: 102805. DOI: 10.1016/j.displa.2024.102805.
[12] 何建航, 孙郡瑤, 刘琼. 基于人体和场景上下文的多人3D姿态估计[J]. 软件学报, 2024, 35(4): 2039-2054. DOI:10.13328/j.cnki.jos.006837.
[13] 唐福梅, 聂勇伟, 余嘉祺, 等. 伪时空图卷积网络修复姿态引导的Transformer行人视频修复方法[J]. 计算机辅助设计与图形学学报, 2024, 36(4): 552-564. DOI: 10.3724/SP.J.1089.2024.19773.
[14] 叶俊, 张云. 基于时空多特征融合网络的三维人体姿态估计[J]. 光电子·激光, 2022, 33(12): 1306-1314. DOI: 10.16136/j.joel.2022.12.0101.
[15] 杨韫韬, 聂勇伟, 张青, 等. 基于RNN和注意力机制的双向人体姿态补全方法[J]. 计算机辅助设计与图形学学报, 2022, 34(11): 1772-1783. DOI: 10.3724/SP.J.1089.2022.19196.
[16] YANG X C, LAN Z P, WANG N, et al. LiteFer: an approach based on MobileViT expression recognition[J]. Sensors, 2024, 24(18): 5868. DOI: 10.3390/s24185868.
[17] 张淑芳, 赖双意, 刘嫣然. 基于多监督的三维人体姿势与形状预测[J]. 天津大学学报(自然科学与工程技术版), 2024, 57(2): 147-154. DOI: 10.11784/tdxbz202211011.
[18] 马金林, 崔琦磊, 马自萍, 等. 预加权调制密集图卷积网络三维人体姿态估计[J]. 计算机科学与探索, 2024, 18(4): 963-977. DOI: 10.3778/j.issn.1673-9418.2302065.
[19] 卫娜, 焦明海. 融合双通道关节约束的三维人体姿态估计[J]. 计算机工程与应用, 2024, 60(23):146-154. DOI: 10.3778/j.issn.1002-8331.2308-0087.
[20] FAN L L, JIANG K L, ZHOU W X, et al. 3D human pose estimation from video via multi-scale multi-level spatial temporal features[J]. Multimedia Tools and Applications, 2024, 83(29): 73533-73552. DOI: 10.1007/s11042-023-17955-6.
[21] EINFALT M, LUDWIG K, LIENHART R. Uplift and upsample: efficient 3D human pose estimation with uplifting transformers[C]// 2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV). Los Alamitos, CA: IEEE Computer Society, 2023: 2902-2912. DOI: 10.1109/WACV56688.2023.00292.
[22] SHAN W K, LU H P, WANG S S, et al. Improving robustness and accuracy via relative information encoding in 3D human pose estimation[C]// MM'21: Proceedings of the 29th ACM International Conference on Multimedia. New York, NY: Association for Computing Machinery, 2021: 3446-3454. DOI:10.1145/3474085.3475504.
[23] TANG Z H, HAO Y B, LI J, et al. FTCM: frequency-temporal collaborative module for efficient 3D human pose estimation in video[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2024, 34(2): 911-923. DOI: 10.1109/TCSVT.2023.3286402.
[24] LIU R X, SHEN J, WANG H, et al. Attention mechanism exploits temporal contexts:real-time 3D human pose reconstruction[C]// 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR). Los Alamitos, CA: IEEE Computer Society, 2020: 5063-5072. DOI: 10.1109/CVPR42600.2020.00511.
[25] ISLAM Z, BEN HAMZA A. Multi-hop graph transformer network for 3D human pose estimation[J]. Journal of Visual Communication and Image Representation, 2024, 101: 104174. DOI: 10.1016/j.jvcir.2024.104174.
[26] 黄程远, 宋晓宁, 冯振华. ARGP-Pose: 基于关键点间关系分析与分组预测的3D人体姿态估计[J]. 计算机应用研究, 2022, 39(7): 2178-2182, 2202. DOI:10.19734/j.issn.1001-3695.2021.11.0618.
[27] ZENG A L, SUN X, HUANG F Y, et al. SRNet:improving generalization in 3D human pose estimation with a split-and-recombine approach[C]// Computer Vision-ECCV 2020: LNIP Volume 12359. Cham: Springer, 2020: 507-523. DOI: 10.1007/978-3-030-58568-6_30.
[28] CHEN T L, FANG C, SHEN X H, et al. Anatomy-aware 3D human pose estimation with bone-based pose decomposition[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2022, 32(1): 198-209. DOI: 10.1109/TCSVT.2021.3057267.
[29] CHEN S, XU Y X, ZOU B J. Prior-knowledge-based self-attention network for 3D human pose estimation[J]. Expert Systems with Applications, 2023, 225: 120213. DOI: 10.1016/j.eswa.2023.120213.
[30] JIA R, YANG H H, ZHAO L, et al. MPA-GNet: multi-scale parallel adaptive graph network for 3D human pose estimation[J]. The Visual Computer, 2024, 40(8): 5883-5899. DOI: 10.1007/s00371-023-03142-z.
[31] WANG Y, KANG H B, WU D D, et al. Global and local spatio-temporal encoder for 3D human pose estimation[J]. IEEE Transactions on Multimedia, 2023, 26: 4039-4049. DOI: 10.1109/TMM.2023.3321438.
[32] 薛峰, 边福利, 李书杰. 面向三维人体坐标及旋转角估计的注意力融合网络[J]. 中国图象图形学报, 2024, 29(10): 3116-3129. DOI: 10.11834/jig.230502.
[33] PAVLLO D, FEICHTENHOFER C, GRANGIER D, et al. 3D human pose estimation in video with temporal convolutions and semi-supervised training[C]// 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Los Alamitos, CA: IEEE Computer Society, 2019: 7745-7754. DOI: 10.1109/CVPR.2019.00794.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed