Log in

Self-supervised method for 3D human pose estimation with consistent shape and viewpoint factorization

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

3D human pose estimation from monocular images has shown great success due to the sophisticated deep network architectures and large 3D human pose datasets. However, it is still an open problem when such datasets are unavailable. Estimating 3D human poses from monocular images is an ill-posed inverse problem. In our work, we propose a novel self-supervised method, which effectively trains a 3D human pose estimation network without any extra 3D pose annotations. Different from the commonly used GAN-based technique, our method overcomes the projection ambiguity problem by fully disentangling the camera viewpoint information from the 3D human shape. Specifically, we design a factorization network to predict the coefficients of canonical 3D human pose and camera viewpoint in two separate channels. Here, we represent the canonical 3D human pose as a combination of pose basis from a dictionary. To guarantee consistent factorization, we design a simple yet effective loss function taking advantage of multi-view information. Besides, in order to generate robust canonical reconstruction from the 3D pose coefficient, we exploit the underlying 3D geometry of human poses to learn a novel hierarchical dictionary from 2D poses. The hierarchical dictionary has stronger 3D pose expressibility than the traditional single-level dictionary. We comprehensively evaluate the proposed method on two public 3D human pose datasets, Human3.6M and MPI-INF-3DHP. The experimental results show that our method can maximally disentangle 3D human shapes and camera viewpoints, as well as reconstruct 3D human poses accurately. Moreover, our method achieves state-of-the-art results compared with recent weakly/self-supervised methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (Germany)

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

References

  1. Andriluka M, Pishchulin L, Gehler P et al (2014) 2D human pose estimation: new benchmark and state of the art analysis. In: Conference on computer vision and pattern recognition. IEEE, pp 3686–3693. https://doi.org/10.1109/cvpr.2014.471

  2. Artacho B, Savakis A (2021) Unipose+: a unified framework for 2D and 3D human pose estimation in images and videos. IEEE Trans Pattern Anal Mach Intell, pp 1–1. https://doi.org/10.1109/TPAMI.2021.3124736

  3. Bao C, Ji H, Quan Y et al (2016) Dictionary learning for sparse coding: algorithms and convergence analysis. IEEE Trans Pattern Anal Mach Intell 38(7):1356–1369. https://doi.org/10.1109/TPAMI.2015.2487966

    Article  Google Scholar 

  4. Cai Y, Ge L, Liu J et al (2019) Exploiting spatial-temporal relationships for 3D pose estimation via graph convolutional networks. In: International conference on computer vision (ICCV). IEEE/CVF, pp 2272–2281. https://doi.org/10.1109/ICCV.2019.00236

  5. Chen CH, Ramanan D (2017) 3D human pose estimation= 2D pose estimation + matching. In: Conference on computer vision and pattern recognition (CVPR). IEEE, pp 5759–5767. https://doi.org/10.1109/cvpr.2017.610

  6. Chen CH, Tyagi A, Agrawal A et al (2019) Unsupervised 3d pose estimation with geometric self-supervision. In: Conference on computer vision and pattern recognition (CVPR). IEEE/CVF, pp 5707–5717. https://doi.org/10.1109/CVPR.2019.00586

  7. Chen X, Lin KY, Liu W et al (2019) Weakly-supervised discovery of geometry-aware representation for 3D human pose estimation. In: Conference on computer vision and pattern recognition (CVPR). IEEE/CVF, pp 10,887–10,896. https://doi.org/10.1109/CVPR.2019.01115

  8. Chen Y, Wang Z, Peng Y et al (2018) Cascaded pyramid network for multi-person pose estimation. In: Conference on computer vision and pattern recognition. IEEE/CVF, pp 7103–7112. https://doi.org/10.1109/CVPR.2018.00742

  9. Cheng Y, Wang B, Yang B et al (2021) Monocular 3D multi-person pose estimation by integrating top-down and bottom-up networks. In: Conference on computer vision and pattern recognition (CVPR). IEEE/CVF, pp 7645–7655. https://doi.org/10.1109/CVPR46437.2021.00756

  10. Ci H, Ma X, Wang C et al (2022) Locally connected network for monocular 3D human pose estimation. IEEE Trans Pattern Anal Mach Intell 44(3):1429–1442. https://doi.org/10.1109/TPAMI.2020.3019139

    Article  Google Scholar 

  11. Dong J, Fang Q, Jiang W et al (2021) Fast and robust multi-person 3D pose estimation and tracking from multiple views. IEEE Trans Pattern Anal Mach Intell, pp 1–1. https://doi.org/10.1109/TPAMI.2021.3098052

  12. Fabbri M, Lanzi F, Calderara S et al (2020) Compressed volumetric heatmaps for multi-person 3D pose estimation. In: Conference on computer vision and pattern recognition (CVPR). IEEE/CVF, pp 7202–7211. https://doi.org/10.1109/cvpr42600.2020.00723

  13. Fang H, Xu Y, Wang W et al (2018) Learning pose grammar to encode human body configuration for 3D pose estimation. In: Proceedings of the AAAI conference on artificial intelligence, pp 6821–6828

  14. Habibie I, Xu W, Mehta D et al (2019) In the wild human pose estimation using explicit 2D features and intermediate 3D representations. In: Conference on computer vision and pattern recognition (CVPR). IEEE/CVF, pp 10,897–10,906. https://doi.org/10.1109/CVPR.2019.01116

  15. He K, Zhang X, Ren S et al (2015) Delving deep into rectifiers: surpassing human-level performance on imagenet classification. In: International conference on computer vision (ICCV). IEEE, pp 1026–1034. https://doi.org/10.1109/ICCV.2015.123

  16. Ionescu C, Papava D, Olaru V et al (2013) Human3.6m: large scale datasets and predictive methods for 3D human sensing in natural environments. IEEE Trans Pattern Anal Mach Intell 36(7):1325–1339. https://doi.org/10.1109/TPAMI.2013.248

    Article  Google Scholar 

  17. Iqbal U, Doering A, Yasin H et al (2018) A dual-source approach for 3D human pose estimation from single images. Comput Vis Image Underst 172:37–49. https://doi.org/10.1016/j.cviu.2018.03.007

    Article  Google Scholar 

  18. Iqbal U, Molchanov P, Kautz J (2020) Weakly-supervised 3D human pose learning via multi-view images in the wild. In: Conference on computer vision and pattern recognition (CVPR). IEEE/CVF, pp 5242–5251. https://doi.org/10.1109/CVPR42600.2020.00529

  19. Kanazawa A, Black MJ, Jacobs DW et al (2018) End-to-end recovery of human shape and pose. In: Conference on computer vision and pattern recognition. IEEE/CVF, pp 7122–7131. https://doi.org/10.1109/CVPR.2018.00744

  20. Kocabas M, Karagoz S, Akbas E (2019) Self-supervised learning of 3D human pose using multi-view geometry. In: Conference on computer vision and pattern recognition (CVPR). IEEE/CVF, pp 1077–1086. https://doi.org/10.1109/CVPR.2019.00117

  21. Kolotouros N, Pavlakos G, Black M et al (2019) Learning to reconstruct 3D human pose and shape via model-fitting in the loop. In: International conference on computer vision (ICCV). IEEE/CVF, pp 2252–2261. https://doi.org/10.1109/ICCV.2019.00234

  22. Kong C, Lucey S (2019) Deep interpretable non-rigid structure from motion. In: International conference on computer vision (ICCV). IEEE/CVF, pp 1558–1567. https://doi.org/10.1109/iccv.2019.00164

  23. Kundu JN, Seth S, Jampani V et al (2020) Self-supervised 3D human pose estimation via part guided novel image synthesis. In: Conference on computer vision and pattern recognition (CVPR). IEEE/CVF, pp 6151–6161. https://doi.org/10.1109/CVPR42600.2020.00619

  24. Li S, Ke L, Pratama K et al (2020) Cascaded deep monocular 3D human pose estimation with evolutionary training data. In: Conference on computer vision and pattern recognition (CVPR). IEEE/CVF, pp 6172–6182. https://doi.org/10.1109/CVPR42600.2020.00621

  25. Li Y, Li K, Jiang S et al (2020) Geometry-driven self-supervised method for 3D human pose estimation. Proceedings of the AAAI Conference on Artificial Intelligence 34(07):11,442–11,449. https://doi.org/10.1609/aaai.v34i07.6808

    Article  Google Scholar 

  26. Li Z, Dekel T, Cole F et al (2021) Mannequinchallenge: learning the depths of moving people by watching frozen people. IEEE Trans Pattern Anal Mach Intell 43 (12):4229–4241. https://doi.org/10.1109/TPAMI.2020.2974454

    Article  Google Scholar 

  27. Lin J, Lee GH (2021) Multi-view multi-person 3D pose estimation with plane sweep stereo. In: Conference on computer vision and pattern recognition (CVPR). IEEE/CVF, pp 11,881–11,890. https://doi.org/10.1109/CVPR46437.2021.01171

  28. Liu J, Ding H, Shahroudy A et al (2020) Feature boosting network for 3D pose estimation. IEEE Trans Pattern Anal Mach Intell 42(2):494–501. https://doi.org/10.1109/TPAMI.2019.2894422

    Article  Google Scholar 

  29. Ma X, Su J, Wang C et al (2021) Context modeling in 3d human pose estimation: a unified perspective. In: Conference on computer vision and pattern recognition (CVPR). IEEE/CVF, pp 6234–6243. https://doi.org/10.1109/CVPR46437.2021.00617

  30. Martinez J, Hossain R, Romero J et al (2017) A simple yet effective baseline for 3d human pose estimation. In: International conference on computer vision (ICCV). IEEE, pp 2659–2668. https://doi.org/10.1109/ICCV.2017.288

  31. Mehta D, Rhodin H, Casas D et al (2017) Monocular 3D human pose estimation in the wild using improved cnn supervision. In: International conference on 3d vision (3DV), pp 506–516. https://doi.org/10.1109/3DV.2017.00064

  32. Mitra R, Gundavarapu NB, Sharma A et al (2020) Multiview-consistent semi-supervised learning for 3d human pose estimation. In: Conference on computer vision and pattern recognition (CVPR). IEEE/CVF, pp 6906–6915. https://doi.org/10.1109/cvpr42600.2020.00694

  33. Novotny D, Ravi N, Graham B et al (2019) C3dpo: canonical 3d pose networks for non-rigid structure from motion. In: International conference on computer vision (ICCV). IEEE/CVF, pp 7687–7696. https://doi.org/10.1109/ICCV.2019.00778

  34. Pavlakos G, Zhou X, Derpanis KG et al (2017) Coarse-to-fine volumetric prediction for single-image 3D human pose. In: Conference on computer vision and pattern recognition (CVPR). IEEE, pp 1263–1272. https://doi.org/10.1109/CVPR.2017.139

  35. Pavlakos G, Zhou X, Derpanis KG et al (2017) Harvesting multiple views for marker-less 3D human pose annotations. In: Conference on computer vision and pattern recognition (CVPR). IEEE, pp 1253–1262. https://doi.org/10.1109/CVPR.2017.138

  36. Pavllo D, Feichtenhofer C, Grangier D et al (2019) 3D human pose estimation in video with temporal convolutions and semi-supervised training. In: conference on computer vision and pattern recognition (CVPR). IEEE/CVF, pp 7745–7754. https://doi.org/10.1109/CVPR.2019.00794

  37. Rhodin H, Meyer F, Sporri J et al (2018) Learning monocular 3D human pose estimation from multi-view images. In: Conference on computer vision and pattern recognition. IEEE/CVF, pp 8437–8446. https://doi.org/10.1109/CVPR.2018.00880

  38. Rhodin H, Salzmann M, Fua P (2018) Unsupervised geometry-aware representation for 3D human pose estimation. In: Computer vision ECCV 2018, pp 765–782. https://doi.org/10.1007/978-3-030-01249-6_46

  39. Scetbon M, Elad M, Milanfar P (2021) Deep k-SVD denoising. IEEE Trans Image Process 30:5944–5955. https://doi.org/10.1109/tip.2021.3090531

    Article  Google Scholar 

  40. Sun X, **ao B, Wei F et al (2018) Integral human pose regression. In: Computer vision ECCV 2018, pp 536–553. https://doi.org/10.1007/978-3-030-01231-1_33

  41. Tekin B, Marquez-Neila P, Salzmann M et al (2017) Learning to fuse 2D and 3D image cues for monocular body pose estimation. In: International conference on computer vision (ICCV). IEEE, pp 3961–3970. https://doi.org/10.1109/ICCV.2017.425

  42. Tome D, Alldieck T, Peluse P et al (2020) Selfpose: 3D egocentric pose estimation from a headset mounted camera. IEEE Trans Pattern Anal Mach Intell, pp 1–1. https://doi.org/10.1109/TPAMI.2020.3029700

  43. Tung HYF, Harley AW, Seto W et al (2017) Adversarial inverse graphics networks: learning 2D-to-3D lifting and image-to-image translation from unpaired supervision. In: International conference on computer vision (ICCV). IEEE, pp 4364–4372. https://doi.org/10.1109/ICCV.2017.467

  44. Wandt B, Rosenhahn B (2019) Repnet: weakly supervised training of an adversarial reprojection network for 3D human pose estimation. In: Conference on computer vision and pattern recognition (CVPR). IEEE/CVF, pp 7774–7783. https://doi.org/10.1109/CVPR.2019.00797

  45. Wandt B, Rudolph M, Zell P et al (2021) CanonPose: self-supervised monocular 3D human pose estimation in the wild. In: Conference on computer vision and pattern recognition (CVPR). IEEE/CVF, pp 13,289–13,299. https://doi.org/10.1109/cvpr46437.2021.01309

  46. Wang C, Kong C, Lucey S (2019) Distill knowledge from nrsfm for weakly supervised 3D pose learning. In: International conference on computer vision (ICCV). IEEE/CVF, pp 743–752. https://doi.org/10.1109/ICCV.2019.00083

  47. Wang C, Qiu H, Yuille AL et al (2019) Learning basis representation to refine 3D human pose estimations. Proceedings of the AAAI Conference on Artificial Intelligence 33(01):8925–8932. https://doi.org/10.1609/aaai.v33i01.33018925

    Article  Google Scholar 

  48. Wang C, Wang Y, Lin Z et al (2019) Robust 3D human pose estimation from single images or video sequences. IEEE Trans Pattern Anal Mach Intell 41(5):1227–1241. https://doi.org/10.1109/TPAMI.2018.2828427

    Article  Google Scholar 

  49. Wang K, Lin L, Jiang C et al (2020) 3D human pose machines with self-supervised learning. IEEE IEEE Trans Pattern Anal Mach Intell 42(5):1069–1082. https://doi.org/10.1109/TPAMI.2019.2892452

    Google Scholar 

  50. Wehrbein T, Rudolph M, Rosenhahn B et al (2021) Probabilistic monocular 3D human pose estimation with normalizing flows. In: International conference on computer vision (ICCV). IEEE/CVF, pp 11,179–11,188. https://doi.org/10.1109/iccv48922.2021.01101

  51. Xu Y, Wang W, Liu T et al (2021) Monocular 3d pose estimation via pose grammar and data augmentation. IEEE Trans Pattern Anal Mach Intell, pp 1–1. https://doi.org/10.1109/TPAMI.2021.3087695

  52. Yang W, Ouyang W, Wang X et al (2018) 3D human pose estimation in the wild by adversarial learning. In: Conference on computer vision and pattern recognition. IEEE/CVF, pp 5255–5264. https://doi.org/10.1109/CVPR.2018.00551

  53. Yuan Y, Wei SE, Simon T et al (2021) SimPoE: simulated character control for 3D human pose estimation. In: Conference on computer vision and pattern recognition (CVPR). IEEE/CVF, pp 7155–7165. https://doi.org/10.1109/CVPR46437.2021.00708

  54. Zhang Z, Wang C, Qin W et al (2020) Fusing wearable imus with multi-view images for human pose estimation: a geometric approach. In: Conference on computer vision and pattern recognition (CVPR). IEEE/CVF, pp 2197–2206. https://doi.org/10.1109/CVPR42600.2020.00227

  55. Zhang Z, Hu L, Deng X et al (2021) Sequential 3D human pose estimation using adaptive point cloud sampling strategy. In: Proceedings of the thirtieth international joint conference on artificial intelligence, pp 1330–1337. https://doi.org/10.24963/ijcai.2021/184

  56. Zhao L, Peng X, Tian Y et al (2019) Semantic graph convolutional networks for 3D human pose regression. In: Conference on computer vision and pattern recognition (CVPR). IEEE/CVF, pp 3420–3430. https://doi.org/10.1109/CVPR.2019.00354

  57. Zheng C, Zhu S, Mendieta M et al (2021) 3D human pose estimation with spatial and temporal transformers. In: International conference on computer vision (ICCV). IEEE/CVF, pp 11,636–11,645. https://doi.org/10.1109/iccv48922.2021.01145

  58. Zhou K, Han X, Jiang N et al (2021) HEMlets posh: learning part-centric heatmap triplets for 3D human pose and shape estimation. IEEE Trans Pattern Anal Mach Intell, pp 1–1. https://doi.org/10.1109/TPAMI.2021.3051173

  59. Zhou X, Huang Q, Sun X et al (2017) Towards 3D human pose estimation in the wild: a weakly-supervised approach. In: International conference on computer vision (ICCV). IEEE, pp 398–407. https://doi.org/10.1109/iccv.2017.51

Download references

Acknowledgements

This work is supported by Bei**g Natural Science Foundation (No.4222037,L181010) and National Natural Science Foundation of China (No.61972035).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Kan Li.

Ethics declarations

Conflict of Interests

The authors declare that they have no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Kan Li and Yang Li contributed equally to this work.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ma, Z., Li, K. & Li, Y. Self-supervised method for 3D human pose estimation with consistent shape and viewpoint factorization. Appl Intell 53, 3864–3876 (2023). https://doi.org/10.1007/s10489-022-03714-x

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-022-03714-x

Keywords

Navigation