View-Invariant, Occlusion-Robust Probabilistic Embedding for Human Pose

Liu, Ting; Sun, Jennifer J.; Zhao, Long; Zhao, Jia**; Yuan, Liangzhe; Wang, Yuxiao; Chen, Liang-Chieh; Schroff, Florian; Adam, Hartwig

doi:10.1007/s11263-021-01529-w

View-Invariant, Occlusion-Robust Probabilistic Embedding for Human Pose

Published: 16 November 2021

Volume 130, pages 111–135, (2022)
Cite this article

International Journal of Computer Vision Aims and scope Submit manuscript

1370 Accesses
2 Altmetric
Explore all metrics

Abstract

Recognition of human poses and actions is crucial for autonomous systems to interact smoothly with people. However, cameras generally capture human poses in 2D as images and videos, which can have significant appearance variations across viewpoints that make the recognition tasks challenging. To address this, we explore recognizing similarity in 3D human body poses from 2D information, which has not been well-studied in existing works. Here, we propose an approach to learning a compact view-invariant embedding space from 2D body joint keypoints, without explicitly predicting 3D poses. Input ambiguities of 2D poses from projection and occlusion are difficult to represent through a deterministic map**, and therefore we adopt a probabilistic formulation for our embedding space. Experimental results show that our embedding model achieves higher accuracy when retrieving similar poses across different camera views, in comparison with 3D pose estimation models. We also show that by training a simple temporal embedding model, we achieve superior performance on pose sequence retrieval and largely reduce the embedding dimension from stacking frame-based embeddings for efficient large-scale retrieval. Furthermore, in order to enable our embeddings to work with partially visible input, we further investigate different keypoint occlusion augmentation strategies during training. We demonstrate that these occlusion augmentations significantly improve retrieval performance on partial 2D input poses. Results on action recognition and video alignment demonstrate that using our embeddings without any additional training achieves competitive performance relative to other models specifically trained for each task.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

View-Invariant Probabilistic Embedding for Human Pose

Supervised Spectral Embedding for Human Pose Estimation

Regularizing Vector Embedding in Bottom-Up Human Pose Estimation

References

Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G.S., Davis, A., Dean, J., Devin, M., Ghemawat, S., Goodfellow, I., Harp, A., Irving, G., Isard, M., Jia, Y., Jozefowicz, R., Kaiser, L., Kudlur, M., Levenberg, J., Mané, D., Monga, R., Moore, S., Murray, D., Olah, C., Schuster, M., Shlens, J., Steiner, B., Sutskever, I., Talwar, K., Tucker, P., Vanhoucke, V., Vasudevan, V., Viégas, F., Vinyals, O., Warden, P., Wattenberg, M., Wicke, M., Yu, Y., & Zheng, X. (2015). TensorFlow: Large-scale machine learning on heterogeneous systems. https://www.tensorflow.org/. Software available from tensorflow.org
Akhter, I., & Black, M. J. (2015). Pose-conditioned joint angle limits for 3D human pose reconstruction. In CVPR.
Andriluka, M., Pishchulin, L., Gehler, P., & Schiele, B. (2014). 2D human pose estimation: New benchmark and state of the art analysis. In CVPR.
Bojchevski, A., & Günnemann, S. (2018). Deep Gaussian embedding of graphs: Unsupervised inductive learning via ranking. In ICLR.
Bromley, J., Guyon, I., LeCun, Y., Säckinger, E., & Shah, R. (1994) . Signature verification using a “siamese” time delay neural network. In NeurIPS.
Cao, C., Zhang, Y., Zhang, C., & Lu, H. (2017). Body joint guided 3-D deep convolutional descriptors for action recognition. IEEE Transactions on Cybernetics, 48(3), 1095–1108.
Article Google Scholar
Chen, C. H., & Ramanan, D. (2017). 3D human pose estimation = 2D pose estimation + matching. In CVPR.
Chen, C. H., Tyagi, A., Agrawal, A., Drover, D., Stojanov, S., & Rehg, J. M. (2019). Unsupervised 3D pose estimation with geometric self-supervision. In CVPR.
Chen, T., Kornblith, S., Norouzi, M., & Hinton, G. (2020). A simple framework for contrastive learning of visual representations. In ICML.
Chen, Y., Wang, Z., Peng, Y., Zhang, Z., Yu, G., & Sun, J. (2018). Cascaded pyramid network for multi-person pose estimation. In CVPR.
Cheng, Y., Yang, B., Wang, B., & Tan, R. T. (2020). 3D human pose estimation using spatio-temporal networks with explicit occlusion training. In AAAI.
Cheng, Y., Yang, B., Wang, B., Yan, W., & Tan, R. T. (2019). Occlusion-aware networks for 3D human pose estimation in video. In ICCV.
Chu, R., Sun, Y., Li, Y., Liu, Z., Zhang, C., & Wei, Y. (2019). Vehicle re-identification with viewpoint-aware metric learning. In ICCV.
Du, W., Wang, Y., & Qiao, Y. (2017). RPAN: An end-to-end recurrent pose-attention network for action recognition in videos. In ICCV.
Duchi, J., Hazan, E., & Singer, Y. (2011). Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul), 2121–2159.
MathSciNet MATH Google Scholar
Dwibedi, D., Aytar, Y., Tompson, J., Sermanet, P., & Zisserman, A. (2019). Temporal cycle-consistency learning. In CVPR.
Garcia-Salguero, M., Gonzalez-Jimenez, J., & Moreno, F. A. (2019). Human 3D pose estimation with a tilting camera for social mobile robot interaction. Sensors, 19(22), 4943.
Article Google Scholar
Gu, R., Wang, G., & Hwang, J. N. (2019). Efficient multi-person hierarchical 3D pose estimation for autonomous driving. In MIPR.
Hadsell, R., Chopra, S., & LeCun, Y. (2006). Dimensionality reduction by learning an invariant map**. In CVPR.
He, K., Gkioxari, G., Dollár, P., & Girshick, R. (2017). Mask R-CNN. In CVPR.
He, X., Zhou, Y., Zhou, Z., Bai, S., & Bai, X. (2018). Triplet-center loss for multi-view 3D object retrieval. In CVPR.
Hermans, A., Beyer, L., & Leibe, B. (2017). In defense of the triplet loss for person re-identification. ar**v preprint ar**v:1703.07737.
Ho, C. H., Morgado, P., Persekian, A., Vasconcelos, N. (2019). PIEs: Pose invariant embeddings. In CVPR.
Hu, W., & Zhu, S. C. (2010). Learning a probabilistic model mixing 3D and 2D primitives for view invariant object recognition. In CVPR.
Huang, C., Loy, C. C., & Tang, X. (2016). Local similarity-aware deep feature embedding. In NeurIPS.
Ionescu, C., Papava, D., Olaru, V., & Sminchisescu, C. (2013). Human3.6M: Large scale datasets and predictive methods for 3D human sensing in natural environments. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(7), 1325–1339.
Article Google Scholar
Iqbal, U., Garbade, M., & Gall, J. (2017). Pose for action-action for pose. In FG.
Iscen, A., Tolias, G., Avrithis, Y., & Chum, O. (2018). Mining on manifolds: Metric learning without labels. In CVPR.
Iskakov, K., Burkov, E., Lempitsky, V., & Malkov, Y. (2019). Learnable triangulation of human pose. In ICCV.
Jammalamadaka, N., Zisserman, A., Eichner, M., Ferrari, V., & Jawahar, C. (2012). Video retrieval by mimicking poses. In ACM MM.
Ji, X., & Liu, H. (2009). Advances in view-invariant human motion analysis: A review. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 40(1), 13–24.
Google Scholar
Ji, X., Liu, H., Li, Y., & Brown, D. (2008). Visual-based view-invariant human motion analysis: A review. In KES.
Kendall, A., & Gal, Y. (2017). What uncertainties do we need in Bayesian deep learning for computer vision? In NeurIPS.
Kendall, M. G. (1938). A new measure of rank correlation. Biometrika, 30(1/2), 81–93.
Article Google Scholar
Kingma, D. P., & Welling, M. (2014). Auto-encoding variational Bayes. In ICLR.
Kipf, T. N., & Welling, M. (2017). Semi-supervised classification with graph convolutional networks. In ICLR.
Kocabas, M., Karagoz, S., & Akbas, E. (2019). Self-supervised learning of 3D human pose using multi-view geometry. In CVPR.
LeCun, Y., Huang, F. J., & Bottou, L., et al. (2004). Learning methods for generic object recognition with invariance to pose and lighting. In CVPR.
Li, J., Wong, Y., Zhao, Q., & Kankanhalli, M. (2018). Unsupervised learning of view-invariant action representations. In NeurIPS.
Li, S., Ke, L., Pratama, K., Tai, Y. W., Tang, C. K., & Cheng, K. T. (2020). Cascaded deep monocular 3D human pose estimation with evolutionary training data. In CVPR.
Lin, T. Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C. L. (2014). Microsoft COCO: Common objects in context. In ECCV.
Liu, J., Akhtar, N., & Ajmal, M. (2018). Viewpoint invariant action recognition using RGB-D videos. IEEE Access, 6, 70061–70071.
Article Google Scholar
Liu, M., Yuan, J. (2018). Recognizing human actions as the evolution of pose estimation maps. In CVPR.
Luvizon, D. C., Tabia, H., & Picard, D. (2020). Multi-task deep learning for real-time 3D human pose estimation and action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence.
von Marcard, T., Henschel, R., Black, M., Rosenhahn, B., & Pons-Moll, G. (2018). Recovering accurate 3D human pose in the wild using IMUs and a moving camera. In ECCV.
Martinez, J., Hossain, R., Romero, J., Little, J.J. (2017). A simple yet effective baseline for 3D human pose estimation. In ICCV.
Mehta, D., Rhodin, H., Casas, D., Fua, P., Sotnychenko, O., Xu, W., & Theobalt, C. (2017). Monocular 3D human pose estimation in the wild using improved CNN supervision. In 3DV.
Misra, I., Zitnick, C. L., & Hebert, M. (2016). Shuffle and learn: Unsupervised learning using temporal order verification. In ECCV.
Mori, G., Pantofaru, C., Kothari, N., Leung, T., Toderici, G., Toshev, A., Yang, W. (2015). Pose embeddings: A deep architecture for learning to match human poses. ar**v preprint ar**v:1507.00302
Nie, B. X., **ong, C., & Zhu, S. C. (2015). Joint action recognition and pose estimation from video. In CVPR.
Oh, S.J., Murphy, K., Pan, J., Roth, J., Schroff, F., & Gallagher, A. (2019). Modeling uncertainty with hedged instance embedding. In ICLR.
Oh Song, H., **ang, Y., Jegelka, S., & Savarese, S. (2016). Deep metric learning via lifted structured feature embedding. In CVPR.
Ong, E. J., Micilotta, A. S., Bowden, R., & Hilton, A. (2006). Viewpoint invariant exemplar-based 3D human tracking. Computer Vision and Image Understanding, 104(2–3), 178–189.
Article Google Scholar
Oord, A. V. D., Li, Y., Vinyals, O. (2018). Representation learning with contrastive predictive coding. In NeurIPS.
Papandreou, G., Zhu, T., Chen, L. C., Gidaris, S., Tompson, J., Murphy, K. (2018). PersonLab: Person pose estimation and instance segmentation with a bottom-up, part-based, geometric embedding model. In ECCV.
Papandreou, G., Zhu, T., Kanazawa, N., Toshev, A., Tompson, J., Bregler, C., & Murphy, K. (2017). Towards accurate multi-person pose estimation in the wild. In CVPR.
Parkhi, O. M., Vedaldi, A., & Zisserman, A. (2015). Deep face recognition. In BMVC.
Pavllo, D., Feichtenhofer, C., Grangier, D., & Auli, M. (2019). 3D human pose estimation in video with temporal convolutions and semi-supervised training. In CVPR.
Qiu, H., Wang, C., Wang, J., Wang, N., & Zeng, W. (2019). Cross view fusion for 3D human pose estimation. In ICCV.
Rao, C., & Shah, M. (2001). View-invariance in action recognition. In CVPR.
Rayat I. H. M., & Little, J. J. (2018). Exploiting temporal information for 3D human pose estimation. In ECCV.
Rhodin, H., Salzmann, M., & Fua, P. (2018). Unsupervised geometry-aware representation for 3D human pose estimation. In ECCV.
Rhodin, H., Spörri, J., Katircioglu, I., Constantin, V., Meyer, F., Müller, E., Salzmann, M., & Fua, P. (2018). Learning monocular 3D human pose estimation from multi-view images. In CVPR.
Ronchi, M. R., Kim, J. S., & Yue, Y. (2016). A rotation invariant latent factor model for moveme discovery from static poses. In ICDM.
Sárándi, I., Linder, T., Arras, K. O., & Leibe, B. (2018). Synthetic occlusion augmentation with volumetric heatmaps for the 2018 ECCV PoseTrack Challenge on 3D human pose estimation. ar**v preprint ar**v:1809.04987
Schroff, F., Kalenichenko, D., & Philbin, J. (2015). FaceNet: A unified embedding for face recognition and clustering. In CVPR.
Sermanet, P., Lynch, C., Chebotar, Y., Hsu, J., Jang, E., Schaal, S., Levine, S., & Brain, G. (2018). Time-contrastive networks: Self-supervised learning from video. In ICRA.
Sun, J. J., Zhao, J., Chen, L. C., Schroff, F., Adam, H., & Liu, T. (2020). View-invariant probabilistic embedding for human pose. In ECCV.
Sun, X., **ao, B., Wei, F., Liang, S., & Wei, Y. (2018). Integral human pose regression. In ECCV.
Tekin, B., Márquez-Neila, P., Salzmann, M., Fua, P. (2017). Learning to fuse 2D and 3D image cues for monocular body pose estimation. In ICCV.
Tome, D., Toso, M., Agapito, L., & Russell, C. (2018). Rethinking pose in 3D: Multi-stage refinement and recovery for markerless motion capture. In 3DV.
Vilnis, L., & McCallum, A. (2015). Word representations via Gaussian embedding. In ICLR.
Wang, J., Song, Y., Leung, T., Rosenberg, C., Wang, J., Philbin, J., Chen, B., & Wu, Y. (2014). Learning fine-grained image similarity with deep ranking. In CVPR.
Wohlhart, P., & Lepetit, V. (2015). Learning descriptors for object recognition and 3D pose estimation. In CVPR.
Wu, C. Y., Manmatha, R., Smola, A. J., & Krahenbuhl, P. (2017). Sampling matters in deep embedding learning. In ICCV.
**a, L., Chen, C. C., & Aggarwal, J. K. (2012). View invariant human action recognition using histograms of 3D joints. In CVPRW.
Zeng, A., Sun, X., Huang, F., Liu, M., Xu, Q., & Lin, S. (2020). SRNet: Improving generalization in 3D human pose estimation with a split-and-recombine approach. In ECCV.
Zhang, W., Zhu, M., & Derpanis, K. G. (2013). From actemes to action: A strongly-supervised representation for detailed action understanding. In ICCV.
Zhao, L., Peng, X., Tian, Y., Kapadia, M., & Metaxas, D. N. (2019). Semantic graph convolutional networks for 3D human pose regression. In CVPR.
Zheng, L., Huang, Y., Lu, H., & Yang, Y. (2019). Pose invariant embedding for deep person re-identification. IEEE Transactions on Image Processing, 28, 4500–4509.
Article MathSciNet Google Scholar
Zhou, X., Huang, Q., Sun, X., Xue, X., Wei, Y. (2017). Towards 3D human pose estimation in the wild: A weakly-supervised approach. In ICCV.

Download references

Acknowledgements

We would like to thank Debidatta Dwibedi, Kree Cole-McLaughlin and Andrew Gallagher from Google Research, ** Zhao, Liangzhe Yuan, Yuxiao Wang, Liang-Chieh Chen, Florian Schroff & Hartwig Adam

California Institute of Technology, Pasadena, CA, 91125, USA

Jennifer J. Sun

Rutgers University, Piscataway, NJ, 08854, USA

Long Zhao

Authors

Ting Liu
View author publications
You can also search for this author in PubMed Google Scholar
Jennifer J. Sun
View author publications
You can also search for this author in PubMed Google Scholar
Long Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Jia** Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Liangzhe Yuan
View author publications
You can also search for this author in PubMed Google Scholar
Yuxiao Wang
View author publications
You can also search for this author in PubMed Google Scholar
Liang-Chieh Chen
View author publications
You can also search for this author in PubMed Google Scholar
Florian Schroff
View author publications
You can also search for this author in PubMed Google Scholar
Hartwig Adam
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ting Liu.

Additional information

Communicated by Gregory Rogez.

Appendices

Appendix

A Keypoint Definition

Figure 16 illustrates the keypoints that we use in our experiments. The 16 keypoints we use to define a 3D pose are shown in Fig. 16a. We map 3D keypoints from different datasets to these 16 keypoints for training and evaluation in this paper. Besides most unambiguous map**s, several special map**s that we would like to note here are:

For the Human3.6M dataset (Ionescu et al. 2013), we discard the “Neck / Nose” keypoint and map the ”Thorax” keypoint to “Neck”.
For the MPI-INF-3DHP dataset (Mehta et al. 2017), we discard the “Head top” keypoint.
For the 3DPW dataset (von Marcard et al. 2018), we add “Pelvis” keypoint as the center of “Left hip” and “Right hip”, and add “Spine” as the center of “Pelvis” and “Neck”.

The 13 2D keypoints we use to define a 2D pose are shown in Fig. 16b. We follow the COCO (Lin et al. 2014) keypoint definition, kee** all the 12 body keypoints and the “Nose” keypoint on the head.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Liu, T., Sun, J.J., Zhao, L. et al. View-Invariant, Occlusion-Robust Probabilistic Embedding for Human Pose. Int J Comput Vis 130, 111–135 (2022). https://doi.org/10.1007/s11263-021-01529-w

Download citation

Received: 15 September 2020
Accepted: 07 September 2021
Published: 16 November 2021
Issue Date: January 2022
DOI: https://doi.org/10.1007/s11263-021-01529-w

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

View-Invariant, Occlusion-Robust Probabilistic Embedding for Human Pose

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

View-Invariant Probabilistic Embedding for Human Pose

Supervised Spectral Embedding for Human Pose Estimation

Regularizing Vector Embedding in Bottom-Up Human Pose Estimation

References

Acknowledgements

Corresponding author

Additional information

Appendices

Appendix

A Keypoint Definition

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

View-Invariant, Occlusion-Robust Probabilistic Embedding for Human Pose

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

View-Invariant Probabilistic Embedding for Human Pose

Supervised Spectral Embedding for Human Pose Estimation

Regularizing Vector Embedding in Bottom-Up Human Pose Estimation

References

Acknowledgements

Corresponding author

Additional information

Appendices

Appendix

A Keypoint Definition

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation