Log in

A fused convolutional spatio-temporal progressive approach for 3D human pose estimation

  • Original article
  • Published:
The Visual Computer Aims and scope Submit manuscript

Abstract

The cascaded self-attention module of vision transformer is good at learning the global correlation between joints for 3D human pose estimation. However, it is still a challenging task to capture the local dynamics of body joints. To this end, we present a fused convolutional spatio-temporal progressive approach that combines convolutional operations and self-attention mechanisms to obtain rich representations of human joints. First, we adopt Spatial Convolutional Transformer Block to fuse spatially both local and global representations across body joints within each frame. Then, we employ Temporal Convolutional Transformer Block to aggregate temporal motion information of the same body joints across different frames in a hierarchical progressive fashion. Finally, we also design a novel full sequence loss (FLoss) function to improve temporal smoothness, resulting in smoother and more reliable 3D poses. Experimental results on the Human 3.6M, HumanEva-I, and MPI-INF-3DHP datasets demonstrate the proposed approach achieves satisfying advancements and promising generalization ability compared to some state-of-the-art methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (Germany)

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Data availability

Human3.6M, HumanEva-I, and MPI-INF-3DHP datasets were used in this study. Data will be made available upon reasonable request to the author.

References

  1. Wang, K., Zhang, G., Yang, J.: 3D human pose and shape estimation with dense correspondence from a single depth image. Vis. Comput. 39, 429–441 (2023)

    Article  Google Scholar 

  2. Wu, J., Hu, D., **ang, F., et al.: 3D human pose estimation by depth map. Vis. Comput. 36, 1401–1410 (2020)

    Article  Google Scholar 

  3. Gao, B., Zhang, Z., Wu, C., et al.: Staged cascaded network for monocular 3D human pose estimation. Appl. Intell. (2022)

  4. Qiu, Z., Zhang, H., Deng, W., et al.: Effective skeleton topology and semantics-guided adaptive graph convolution network for action recognition. Vis. Comput. 39, 2191–2203 (2023)

    Article  Google Scholar 

  5. Yuan, H., Lee, J.H., Zhang, S.: Research on simulation of 3D human animation vision technology based on an enhanced machine learning algorithm. Neural Comput. Appl. 35, 4243–4254 (2023)

    Article  Google Scholar 

  6. Mofarreh-Bonab, M., Seyedarabi, H., Mozaffari Tazehkand, B., et al.: 3D hand pose estimation using RGBD images and hybrid deep learning networks. Vis. Comput. 38, 2023–2032 (2022)

    Article  Google Scholar 

  7. Hua, G., Liu, H., Li, W., Zhang, Q., Ding, R., Xu, X.: Weakly-supervised 3D human pose estimation with cross-view U-shaped graph convolutional network. IEEE Trans. Multimedia (2022)

  8. Martinez, J., Hossain, R., Romero, J., Little, J.J.: A simple yet effective baseline for 3d human pose estimation. In: IEEE International Conference on Computer Vision, pp. 2640–2649 (2017)

  9. Chen, X., Lin, K., Liu, W., Qian, C., Lin, L.: Weakly-supervised discovery of geometry-aware representation for 3d human pose estimation. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 10895–10904 (2019)

  10. Hossain, M.R.I., Little, J.J.: Exploiting temporal information for 3d human pose estimation. In: European conference on computer vision, pp. 69–86 (2018)

  11. Chen, T., Fang, C., Shen, X., Zhu, Y., Chen, Z., Luo, J.: Anatomy-aware 3D human pose estimation with bone-based pose decomposition. IEEE Trans. Circuits Syst. Video Technol. (2021)

  12. Liu, R., Shen, J., Wang, H., Chen, C., Cheung, S., Asari, V.: Attention mechanism exploits temporal contexts: Real-time 3D human pose reconstruction. In: IEEE Conference on Computer Vision and Pattern Recognition, 5064–5073 (2020)

  13. Wang, J., Yan, S., **ong, Y., Lin, D.: Motion guided 3D pose estimation from videos. In: European conference on computer vision, pp. 764–780 (2020)

  14. Zhang, J., Wang, Y., Zhou, Z., Luan, T., Wang, Z., Qiao, Y.: learning dynamical human-joint affinity for 3D pose estimation in videos. IEEE Trans. Image Process. 30, 7914–7925 (2021)

    Article  Google Scholar 

  15. Zeng, A., Sun, X., Yang, L., Zhao, N., Liu, M., Xu, Q.: Learning skeletal graph neural networks for hard 3D pose estimation. In: IEEE International Conference on Computer Vision, pp. 11416–11425 (2021)

  16. Dosovitskiy, A., et al.: An image is worth \(16\times 16\) words: transformers for image recognition at scale. In: International Conference on Learning Representations (2021)

  17. Shuai, H., Wu, L., Liu, Q.: Adaptive multi-view and temporal fusing transformer for 3D human pose estimation. IEEE Trans. Pattern Anal. Mach. Intell. (2022)

  18. Lin, K., Wang, L., Liu, Z.: End-to-end human pose and mesh reconstruction with transformers. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1954–1963 (2021)

  19. Zheng, C., Zhu, S., Mendieta, M., Yang, T., Chen, C., Ding, Z.: 3D human pose estimation with spatial and temporal transformers. In: IEEE International Conference on Computer Vision, pp. 11636–11645 (2021)

  20. Ma, X., Su, J., Wang, C., Ci, H., Wang, Y.: Context modeling in 3d human pose estimation: a unified perspective. IEEE Conference on Computer Vision and Pattern Recognition, 6238–6247 (2021)

  21. Liu, J., Ding, H., Shahroudy, A., Duan, L., Jiang, X., Wang, G., Kot, A.C.: Feature boosting network for 3d pose estimation. IEEE Trans. Pattern Anal. Mach. Intell. 42, 494–501 (2019)

    Article  Google Scholar 

  22. Sun, X., **ao, B., Wei, F., Liang, S., Wei, Y.: Integral human pose regression. In: European Conference on Computer Vision, pp. 529–545 (2018)

  23. Li, W., Liu, H., Tang, H., Wang, P., Gool, L.V.: MHFormer: Multi-hypothesis transformer for 3D human pose estimation. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 13147–13156 (2022)

  24. Li, C., Lee, G.H.: Generating multiple hypotheses for 3D human pose estimation with mixture density network. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 9887–9895 (2019)

  25. Ci, H., Wang, C., Ma, X., Wang, Y.: Optimizing network structure for 3d human pose estimation. In: IEEE International Conference on Computer Vision, pp. 2262–2271 (2019)

  26. Chen, Y., Wang, Z., Peng, Y., Zhang, Z., Yu, G., Sun, J.: Cascaded pyramid network for multi-person pose estimation. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 7103–7112 (2018)

  27. He, K., Gkioxari, G., Dollar, P., Girshick, R.: Mask r-cnn. In: IEEE International Conference on Computer Vision, pp. 2961–2969 (2017)

  28. Sun, K., **ao, B., Liu, D., Wang, J.: Deep high-resolution representation learning for human pose estimation. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 5686–5696 (2019)

  29. Zhao, L., Peng, X., Tian, Y., Kapadia, M., Metaxas, D.N.: Semantic graph convolutional networks for 3d human pose regression. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 3425–3435 (2019)

  30. Xu, T., Takano, W.: Graph stacked hourglass networks for 3d human pose estimation. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 16105–16114 (2021)

  31. Zou, Z., Tang, W.: Modulated graph convolutional network for 3D human pose estimation. In: IEEE International Conference on Computer Vision, pp. 11477–11487 (2021)

  32. Cai, Y., Ge, L., Liu, J., Cai, J., Cham, T.J., Yuan, J., Thalmann, N.M.: Exploiting spatio-temporal relationships for 3d pose estimation via graph convolutional networks. In: IEEE International Conference on Computer Vision, pp. 2272–2281 (2019)

  33. Pavllo, D., Feichtenhofer, C., Grangier, D., Auli, M.: 3D human pose estimation in video with temporal convolutions and semi-supervised training. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 7753–7762 (2019)

  34. Cheng, Y., Yang, B., Wang, B., Wending, Y., Tan, R.: Occlusion-aware networks for 3d human pose estimation in video. In: IEEE International Conference on Computer Vision, pp. 723–732 (2019)

  35. Zhang, J., Tu, Z., Yang, J., Chen, Y., Yuan, J.: MixSTE: Seq2seq mixed spatio-temporal encoder for 3D human pose estimation in video. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 13232–13242 (2022)

  36. Li, W., Liu, H., Ding, R., Liu, M., Wang, P., Yang, W.: Exploiting temporal contexts with strided transformer for 3D human pose estimation. IEEE Trans. Multimedia (2022)

  37. Yang, S., Quan, Z., Nie, M., Yang, W.: TransPose: keypoint localization via transformer. In: IEEE International Conference on Computer Vision, pp. 11782–11792 (2021)

  38. Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In: International Conference on Machine Learning, pp. 448–456 (2015)

  39. Nair, V., Hinton, G.E.: Rectified linear units improve restricted Boltzmann machines. In: International Conference on Machine Learning (ICML), pp. 807–814 (2010)

  40. Ionescu, C., Papava, D., Olaru, V., Sminchisescu, C.: Human3. 6m: large scale datasets and predictive methods for 3d human sensing in natural environments. IEEE Trans. Pattern Anal. Mach. Intell. 36, 1325–1339 (2013)

    Article  Google Scholar 

  41. Sigal, L., Balan, A.O., Black, M.J.: HumanEva: synchronized video and motion capture dataset and baseline algorithm for evaluation of articulated human motion. Int. J. Comput. Vision 87, 4–27 (2010)

    Article  Google Scholar 

  42. Mehta, D., Rhodin, H., Casas, D., Fua, P., Sotnychenko, O., Xu, W., Theobalt, C.: Monocular 3d human pose estimation in the wild using improved cnn supervision. In: International Conference on 3D Vision (3DV), pp. 506–516 (2017)

  43. Gong, K., Zhang, J., Feng, J.: Poseaug: A differentiable pose augmentation framework for 3d human pose estimation. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 8575–8584 (2021)

  44. Yeh, R., Hu, Y., Schwing, A.: Chirality nets for human pose regression. Adv. Neural. Inf. Process. Syst. 32, 8163–8173 (2019)

    Google Scholar 

  45. Zeng, A., Sun, X., Huang, F., Liu, M., Xu, Q., Lin, S.: Srnet: improving generalization in 3d human pose estimation with a split-and-recombine approach. In: European Conference on Computer Vision, pp. 507–523 (2020)

  46. Lin, J., Lee, G.H.: Trajectory space factorization for deep video-based 3d human pose estimation. In: British Machine Vision Conference (2019)

  47. Pavlakos, G., Zhou, X., Derpanis, K.G., Daniilidis, K.: Coarse-to-fine volumetric prediction for single-image 3d human pose. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 7025–7034 (2017)

  48. Lee, K., Lee, I., Lee, S.: Propagating lstm: 3d pose estimation based on joint interdependency. In: European Conference on Computer Vision, pp. 119–135 (2018)

Download references

Acknowledgements

This work is supported by the National Natural Science Foundation of China under Grants 61771420 and 62001413, the Natural Science Foundation of Hebei Province under Grants F2020203064, Science and Technology Project of Hebei Education Department under Grants BJK2023117.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zheng** Hu.

Ethics declarations

Conflict of interest

The authors declare that we do not have any commercial or associative interest that represents a conflict of interest in connection with the work submitted. The authors declare that they have no conflicts of interest to this work.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, H., Hu, Z., Sun, Z. et al. A fused convolutional spatio-temporal progressive approach for 3D human pose estimation. Vis Comput 40, 4387–4399 (2024). https://doi.org/10.1007/s00371-023-03088-2

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00371-023-03088-2

Keywords

Navigation