A fused convolutional spatio-temporal progressive approach for 3D human pose estimation

Zhang, Hehao; Hu, Zheng**; Sun, Zhe; Zhao, Mengyao; Bi, Shuai; Di, Jirui

doi:10.1007/s00371-023-03088-2

A fused convolutional spatio-temporal progressive approach for 3D human pose estimation

Original article
Published: 20 September 2023

Volume 40, pages 4387–4399, (2024)
Cite this article

The Visual Computer Aims and scope Submit manuscript

Hehao Zhang¹,
Zheng** Hu ORCID: orcid.org/0000-0003-0300-6144¹,
Zhe Sun¹,
Mengyao Zhao¹,
Shuai Bi¹ &
…
Jirui Di¹

319 Accesses
Explore all metrics

Abstract

The cascaded self-attention module of vision transformer is good at learning the global correlation between joints for 3D human pose estimation. However, it is still a challenging task to capture the local dynamics of body joints. To this end, we present a fused convolutional spatio-temporal progressive approach that combines convolutional operations and self-attention mechanisms to obtain rich representations of human joints. First, we adopt Spatial Convolutional Transformer Block to fuse spatially both local and global representations across body joints within each frame. Then, we employ Temporal Convolutional Transformer Block to aggregate temporal motion information of the same body joints across different frames in a hierarchical progressive fashion. Finally, we also design a novel full sequence loss (FLoss) function to improve temporal smoothness, resulting in smoother and more reliable 3D poses. Experimental results on the Human 3.6M, HumanEva-I, and MPI-INF-3DHP datasets demonstrate the proposed approach achieves satisfying advancements and promising generalization ability compared to some state-of-the-art methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Price includes VAT (Germany)

Instant access to the full article PDF.

Institutional subscriptions

Fig. 3

ConvFormer: parameter reduction in transformer models for 3D human pose estimation by leveraging dynamic multi-headed convolutional attention

Article 03 July 2023

U-shaped spatial–temporal transformer network for 3D human pose estimation

Article 04 September 2022

3D Human pose estimation from video via multi-scale multi-level spatial temporal features

Article 22 January 2024

Data availability

Human3.6M, HumanEva-I, and MPI-INF-3DHP datasets were used in this study. Data will be made available upon reasonable request to the author.

References

Wang, K., Zhang, G., Yang, J.: 3D human pose and shape estimation with dense correspondence from a single depth image. Vis. Comput. 39, 429–441 (2023)
Article Google Scholar
Wu, J., Hu, D., **ang, F., et al.: 3D human pose estimation by depth map. Vis. Comput. 36, 1401–1410 (2020)
Article Google Scholar
Gao, B., Zhang, Z., Wu, C., et al.: Staged cascaded network for monocular 3D human pose estimation. Appl. Intell. (2022)
Qiu, Z., Zhang, H., Deng, W., et al.: Effective skeleton topology and semantics-guided adaptive graph convolution network for action recognition. Vis. Comput. 39, 2191–2203 (2023)
Article Google Scholar
Yuan, H., Lee, J.H., Zhang, S.: Research on simulation of 3D human animation vision technology based on an enhanced machine learning algorithm. Neural Comput. Appl. 35, 4243–4254 (2023)
Article Google Scholar
Mofarreh-Bonab, M., Seyedarabi, H., Mozaffari Tazehkand, B., et al.: 3D hand pose estimation using RGBD images and hybrid deep learning networks. Vis. Comput. 38, 2023–2032 (2022)
Article Google Scholar
Hua, G., Liu, H., Li, W., Zhang, Q., Ding, R., Xu, X.: Weakly-supervised 3D human pose estimation with cross-view U-shaped graph convolutional network. IEEE Trans. Multimedia (2022)
Martinez, J., Hossain, R., Romero, J., Little, J.J.: A simple yet effective baseline for 3d human pose estimation. In: IEEE International Conference on Computer Vision, pp. 2640–2649 (2017)
Chen, X., Lin, K., Liu, W., Qian, C., Lin, L.: Weakly-supervised discovery of geometry-aware representation for 3d human pose estimation. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 10895–10904 (2019)
Hossain, M.R.I., Little, J.J.: Exploiting temporal information for 3d human pose estimation. In: European conference on computer vision, pp. 69–86 (2018)
Chen, T., Fang, C., Shen, X., Zhu, Y., Chen, Z., Luo, J.: Anatomy-aware 3D human pose estimation with bone-based pose decomposition. IEEE Trans. Circuits Syst. Video Technol. (2021)
Liu, R., Shen, J., Wang, H., Chen, C., Cheung, S., Asari, V.: Attention mechanism exploits temporal contexts: Real-time 3D human pose reconstruction. In: IEEE Conference on Computer Vision and Pattern Recognition, 5064–5073 (2020)
Wang, J., Yan, S., **ong, Y., Lin, D.: Motion guided 3D pose estimation from videos. In: European conference on computer vision, pp. 764–780 (2020)
Zhang, J., Wang, Y., Zhou, Z., Luan, T., Wang, Z., Qiao, Y.: learning dynamical human-joint affinity for 3D pose estimation in videos. IEEE Trans. Image Process. 30, 7914–7925 (2021)
Article Google Scholar
Zeng, A., Sun, X., Yang, L., Zhao, N., Liu, M., Xu, Q.: Learning skeletal graph neural networks for hard 3D pose estimation. In: IEEE International Conference on Computer Vision, pp. 11416–11425 (2021)
Dosovitskiy, A., et al.: An image is worth \(16\times 16\) words: transformers for image recognition at scale. In: International Conference on Learning Representations (2021)
Shuai, H., Wu, L., Liu, Q.: Adaptive multi-view and temporal fusing transformer for 3D human pose estimation. IEEE Trans. Pattern Anal. Mach. Intell. (2022)
Lin, K., Wang, L., Liu, Z.: End-to-end human pose and mesh reconstruction with transformers. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1954–1963 (2021)
Zheng, C., Zhu, S., Mendieta, M., Yang, T., Chen, C., Ding, Z.: 3D human pose estimation with spatial and temporal transformers. In: IEEE International Conference on Computer Vision, pp. 11636–11645 (2021)
Ma, X., Su, J., Wang, C., Ci, H., Wang, Y.: Context modeling in 3d human pose estimation: a unified perspective. IEEE Conference on Computer Vision and Pattern Recognition, 6238–6247 (2021)
Liu, J., Ding, H., Shahroudy, A., Duan, L., Jiang, X., Wang, G., Kot, A.C.: Feature boosting network for 3d pose estimation. IEEE Trans. Pattern Anal. Mach. Intell. 42, 494–501 (2019)
Article Google Scholar
Sun, X., **ao, B., Wei, F., Liang, S., Wei, Y.: Integral human pose regression. In: European Conference on Computer Vision, pp. 529–545 (2018)
Li, W., Liu, H., Tang, H., Wang, P., Gool, L.V.: MHFormer: Multi-hypothesis transformer for 3D human pose estimation. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 13147–13156 (2022)
Li, C., Lee, G.H.: Generating multiple hypotheses for 3D human pose estimation with mixture density network. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 9887–9895 (2019)
Ci, H., Wang, C., Ma, X., Wang, Y.: Optimizing network structure for 3d human pose estimation. In: IEEE International Conference on Computer Vision, pp. 2262–2271 (2019)
Chen, Y., Wang, Z., Peng, Y., Zhang, Z., Yu, G., Sun, J.: Cascaded pyramid network for multi-person pose estimation. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 7103–7112 (2018)
He, K., Gkioxari, G., Dollar, P., Girshick, R.: Mask r-cnn. In: IEEE International Conference on Computer Vision, pp. 2961–2969 (2017)
Sun, K., **ao, B., Liu, D., Wang, J.: Deep high-resolution representation learning for human pose estimation. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 5686–5696 (2019)
Zhao, L., Peng, X., Tian, Y., Kapadia, M., Metaxas, D.N.: Semantic graph convolutional networks for 3d human pose regression. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 3425–3435 (2019)
Xu, T., Takano, W.: Graph stacked hourglass networks for 3d human pose estimation. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 16105–16114 (2021)
Zou, Z., Tang, W.: Modulated graph convolutional network for 3D human pose estimation. In: IEEE International Conference on Computer Vision, pp. 11477–11487 (2021)
Cai, Y., Ge, L., Liu, J., Cai, J., Cham, T.J., Yuan, J., Thalmann, N.M.: Exploiting spatio-temporal relationships for 3d pose estimation via graph convolutional networks. In: IEEE International Conference on Computer Vision, pp. 2272–2281 (2019)
Pavllo, D., Feichtenhofer, C., Grangier, D., Auli, M.: 3D human pose estimation in video with temporal convolutions and semi-supervised training. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 7753–7762 (2019)
Cheng, Y., Yang, B., Wang, B., Wending, Y., Tan, R.: Occlusion-aware networks for 3d human pose estimation in video. In: IEEE International Conference on Computer Vision, pp. 723–732 (2019)
Zhang, J., Tu, Z., Yang, J., Chen, Y., Yuan, J.: MixSTE: Seq2seq mixed spatio-temporal encoder for 3D human pose estimation in video. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 13232–13242 (2022)
Li, W., Liu, H., Ding, R., Liu, M., Wang, P., Yang, W.: Exploiting temporal contexts with strided transformer for 3D human pose estimation. IEEE Trans. Multimedia (2022)
Yang, S., Quan, Z., Nie, M., Yang, W.: TransPose: keypoint localization via transformer. In: IEEE International Conference on Computer Vision, pp. 11782–11792 (2021)
Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In: International Conference on Machine Learning, pp. 448–456 (2015)
Nair, V., Hinton, G.E.: Rectified linear units improve restricted Boltzmann machines. In: International Conference on Machine Learning (ICML), pp. 807–814 (2010)
Ionescu, C., Papava, D., Olaru, V., Sminchisescu, C.: Human3. 6m: large scale datasets and predictive methods for 3d human sensing in natural environments. IEEE Trans. Pattern Anal. Mach. Intell. 36, 1325–1339 (2013)
Article Google Scholar
Sigal, L., Balan, A.O., Black, M.J.: HumanEva: synchronized video and motion capture dataset and baseline algorithm for evaluation of articulated human motion. Int. J. Comput. Vision 87, 4–27 (2010)
Article Google Scholar
Mehta, D., Rhodin, H., Casas, D., Fua, P., Sotnychenko, O., Xu, W., Theobalt, C.: Monocular 3d human pose estimation in the wild using improved cnn supervision. In: International Conference on 3D Vision (3DV), pp. 506–516 (2017)
Gong, K., Zhang, J., Feng, J.: Poseaug: A differentiable pose augmentation framework for 3d human pose estimation. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 8575–8584 (2021)
Yeh, R., Hu, Y., Schwing, A.: Chirality nets for human pose regression. Adv. Neural. Inf. Process. Syst. 32, 8163–8173 (2019)
Google Scholar
Zeng, A., Sun, X., Huang, F., Liu, M., Xu, Q., Lin, S.: Srnet: improving generalization in 3d human pose estimation with a split-and-recombine approach. In: European Conference on Computer Vision, pp. 507–523 (2020)
Lin, J., Lee, G.H.: Trajectory space factorization for deep video-based 3d human pose estimation. In: British Machine Vision Conference (2019)
Pavlakos, G., Zhou, X., Derpanis, K.G., Daniilidis, K.: Coarse-to-fine volumetric prediction for single-image 3d human pose. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 7025–7034 (2017)
Lee, K., Lee, I., Lee, S.: Propagating lstm: 3d pose estimation based on joint interdependency. In: European Conference on Computer Vision, pp. 119–135 (2018)

Download references

Acknowledgements

This work is supported by the National Natural Science Foundation of China under Grants 61771420 and 62001413, the Natural Science Foundation of Hebei Province under Grants F2020203064, Science and Technology Project of Hebei Education Department under Grants BJK2023117.

Author information

Authors and Affiliations

Information Science and Engineering, Yanshan University, Qinhuangdao, 066000, China
Hehao Zhang, Zheng** Hu, Zhe Sun, Mengyao Zhao, Shuai Bi & Jirui Di

Authors

Hehao Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Zheng** Hu
View author publications
You can also search for this author in PubMed Google Scholar
Zhe Sun
View author publications
You can also search for this author in PubMed Google Scholar
Mengyao Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Shuai Bi
View author publications
You can also search for this author in PubMed Google Scholar
Jirui Di
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zheng** Hu.

Ethics declarations

Conflict of interest

The authors declare that we do not have any commercial or associative interest that represents a conflict of interest in connection with the work submitted. The authors declare that they have no conflicts of interest to this work.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Zhang, H., Hu, Z., Sun, Z. et al. A fused convolutional spatio-temporal progressive approach for 3D human pose estimation. Vis Comput 40, 4387–4399 (2024). https://doi.org/10.1007/s00371-023-03088-2

Download citation

Accepted: 30 August 2023
Published: 20 September 2023
Issue Date: June 2024
DOI: https://doi.org/10.1007/s00371-023-03088-2

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Price includes VAT (Germany)

Instant access to the full article PDF.

Institutional subscriptions

A fused convolutional spatio-temporal progressive approach for 3D human pose estimation

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

ConvFormer: parameter reduction in transformer models for 3D human pose estimation by leveraging dynamic multi-headed convolutional attention

U-shaped spatial–temporal transformer network for 3D human pose estimation

3D Human pose estimation from video via multi-scale multi-level spatial temporal features

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

A fused convolutional spatio-temporal progressive approach for 3D human pose estimation

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

ConvFormer: parameter reduction in transformer models for 3D human pose estimation by leveraging dynamic multi-headed convolutional attention

U-shaped spatial–temporal transformer network for 3D human pose estimation

3D Human pose estimation from video via multi-scale multi-level spatial temporal features

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation