Log in

View-Invariant, Occlusion-Robust Probabilistic Embedding for Human Pose

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

Recognition of human poses and actions is crucial for autonomous systems to interact smoothly with people. However, cameras generally capture human poses in 2D as images and videos, which can have significant appearance variations across viewpoints that make the recognition tasks challenging. To address this, we explore recognizing similarity in 3D human body poses from 2D information, which has not been well-studied in existing works. Here, we propose an approach to learning a compact view-invariant embedding space from 2D body joint keypoints, without explicitly predicting 3D poses. Input ambiguities of 2D poses from projection and occlusion are difficult to represent through a deterministic map**, and therefore we adopt a probabilistic formulation for our embedding space. Experimental results show that our embedding model achieves higher accuracy when retrieving similar poses across different camera views, in comparison with 3D pose estimation models. We also show that by training a simple temporal embedding model, we achieve superior performance on pose sequence retrieval and largely reduce the embedding dimension from stacking frame-based embeddings for efficient large-scale retrieval. Furthermore, in order to enable our embeddings to work with partially visible input, we further investigate different keypoint occlusion augmentation strategies during training. We demonstrate that these occlusion augmentations significantly improve retrieval performance on partial 2D input poses. Results on action recognition and video alignment demonstrate that using our embeddings without any additional training achieves competitive performance relative to other models specifically trained for each task.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16

Similar content being viewed by others

References

  • Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G.S., Davis, A., Dean, J., Devin, M., Ghemawat, S., Goodfellow, I., Harp, A., Irving, G., Isard, M., Jia, Y., Jozefowicz, R., Kaiser, L., Kudlur, M., Levenberg, J., Mané, D., Monga, R., Moore, S., Murray, D., Olah, C., Schuster, M., Shlens, J., Steiner, B., Sutskever, I., Talwar, K., Tucker, P., Vanhoucke, V., Vasudevan, V., Viégas, F., Vinyals, O., Warden, P., Wattenberg, M., Wicke, M., Yu, Y., & Zheng, X. (2015). TensorFlow: Large-scale machine learning on heterogeneous systems. https://www.tensorflow.org/. Software available from tensorflow.org

  • Akhter, I., & Black, M. J. (2015). Pose-conditioned joint angle limits for 3D human pose reconstruction. In CVPR.

  • Andriluka, M., Pishchulin, L., Gehler, P., & Schiele, B. (2014). 2D human pose estimation: New benchmark and state of the art analysis. In CVPR.

  • Bojchevski, A., & Günnemann, S. (2018). Deep Gaussian embedding of graphs: Unsupervised inductive learning via ranking. In ICLR.

  • Bromley, J., Guyon, I., LeCun, Y., Säckinger, E., & Shah, R. (1994) . Signature verification using a “siamese” time delay neural network. In NeurIPS.

  • Cao, C., Zhang, Y., Zhang, C., & Lu, H. (2017). Body joint guided 3-D deep convolutional descriptors for action recognition. IEEE Transactions on Cybernetics, 48(3), 1095–1108.

    Article  Google Scholar 

  • Chen, C. H., & Ramanan, D. (2017). 3D human pose estimation = 2D pose estimation + matching. In CVPR.

  • Chen, C. H., Tyagi, A., Agrawal, A., Drover, D., Stojanov, S., & Rehg, J. M. (2019). Unsupervised 3D pose estimation with geometric self-supervision. In CVPR.

  • Chen, T., Kornblith, S., Norouzi, M., & Hinton, G. (2020). A simple framework for contrastive learning of visual representations. In ICML.

  • Chen, Y., Wang, Z., Peng, Y., Zhang, Z., Yu, G., & Sun, J. (2018). Cascaded pyramid network for multi-person pose estimation. In CVPR.

  • Cheng, Y., Yang, B., Wang, B., & Tan, R. T. (2020). 3D human pose estimation using spatio-temporal networks with explicit occlusion training. In AAAI.

  • Cheng, Y., Yang, B., Wang, B., Yan, W., & Tan, R. T. (2019). Occlusion-aware networks for 3D human pose estimation in video. In ICCV.

  • Chu, R., Sun, Y., Li, Y., Liu, Z., Zhang, C., & Wei, Y. (2019). Vehicle re-identification with viewpoint-aware metric learning. In ICCV.

  • Du, W., Wang, Y., & Qiao, Y. (2017). RPAN: An end-to-end recurrent pose-attention network for action recognition in videos. In ICCV.

  • Duchi, J., Hazan, E., & Singer, Y. (2011). Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul), 2121–2159.

    MathSciNet  MATH  Google Scholar 

  • Dwibedi, D., Aytar, Y., Tompson, J., Sermanet, P., & Zisserman, A. (2019). Temporal cycle-consistency learning. In CVPR.

  • Garcia-Salguero, M., Gonzalez-Jimenez, J., & Moreno, F. A. (2019). Human 3D pose estimation with a tilting camera for social mobile robot interaction. Sensors, 19(22), 4943.

    Article  Google Scholar 

  • Gu, R., Wang, G., & Hwang, J. N. (2019). Efficient multi-person hierarchical 3D pose estimation for autonomous driving. In MIPR.

  • Hadsell, R., Chopra, S., & LeCun, Y. (2006). Dimensionality reduction by learning an invariant map**. In CVPR.

  • He, K., Gkioxari, G., Dollár, P., & Girshick, R. (2017). Mask R-CNN. In CVPR.

  • He, X., Zhou, Y., Zhou, Z., Bai, S., & Bai, X. (2018). Triplet-center loss for multi-view 3D object retrieval. In CVPR.

  • Hermans, A., Beyer, L., & Leibe, B. (2017). In defense of the triplet loss for person re-identification. ar**v preprint ar**v:1703.07737.

  • Ho, C. H., Morgado, P., Persekian, A., Vasconcelos, N. (2019). PIEs: Pose invariant embeddings. In CVPR.

  • Hu, W., & Zhu, S. C. (2010). Learning a probabilistic model mixing 3D and 2D primitives for view invariant object recognition. In CVPR.

  • Huang, C., Loy, C. C., & Tang, X. (2016). Local similarity-aware deep feature embedding. In NeurIPS.

  • Ionescu, C., Papava, D., Olaru, V., & Sminchisescu, C. (2013). Human3.6M: Large scale datasets and predictive methods for 3D human sensing in natural environments. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(7), 1325–1339.

    Article  Google Scholar 

  • Iqbal, U., Garbade, M., & Gall, J. (2017). Pose for action-action for pose. In FG.

  • Iscen, A., Tolias, G., Avrithis, Y., & Chum, O. (2018). Mining on manifolds: Metric learning without labels. In CVPR.

  • Iskakov, K., Burkov, E., Lempitsky, V., & Malkov, Y. (2019). Learnable triangulation of human pose. In ICCV.

  • Jammalamadaka, N., Zisserman, A., Eichner, M., Ferrari, V., & Jawahar, C. (2012). Video retrieval by mimicking poses. In ACM MM.

  • Ji, X., & Liu, H. (2009). Advances in view-invariant human motion analysis: A review. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 40(1), 13–24.

    Google Scholar 

  • Ji, X., Liu, H., Li, Y., & Brown, D. (2008). Visual-based view-invariant human motion analysis: A review. In KES.

  • Kendall, A., & Gal, Y. (2017). What uncertainties do we need in Bayesian deep learning for computer vision? In NeurIPS.

  • Kendall, M. G. (1938). A new measure of rank correlation. Biometrika, 30(1/2), 81–93.

    Article  Google Scholar 

  • Kingma, D. P., & Welling, M. (2014). Auto-encoding variational Bayes. In ICLR.

  • Kipf, T. N., & Welling, M. (2017). Semi-supervised classification with graph convolutional networks. In ICLR.

  • Kocabas, M., Karagoz, S., & Akbas, E. (2019). Self-supervised learning of 3D human pose using multi-view geometry. In CVPR.

  • LeCun, Y., Huang, F. J., & Bottou, L., et al. (2004). Learning methods for generic object recognition with invariance to pose and lighting. In CVPR.

  • Li, J., Wong, Y., Zhao, Q., & Kankanhalli, M. (2018). Unsupervised learning of view-invariant action representations. In NeurIPS.

  • Li, S., Ke, L., Pratama, K., Tai, Y. W., Tang, C. K., & Cheng, K. T. (2020). Cascaded deep monocular 3D human pose estimation with evolutionary training data. In CVPR.

  • Lin, T. Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C. L. (2014). Microsoft COCO: Common objects in context. In ECCV.

  • Liu, J., Akhtar, N., & Ajmal, M. (2018). Viewpoint invariant action recognition using RGB-D videos. IEEE Access, 6, 70061–70071.

    Article  Google Scholar 

  • Liu, M., Yuan, J. (2018). Recognizing human actions as the evolution of pose estimation maps. In CVPR.

  • Luvizon, D. C., Tabia, H., & Picard, D. (2020). Multi-task deep learning for real-time 3D human pose estimation and action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence.

  • von Marcard, T., Henschel, R., Black, M., Rosenhahn, B., & Pons-Moll, G. (2018). Recovering accurate 3D human pose in the wild using IMUs and a moving camera. In ECCV.

  • Martinez, J., Hossain, R., Romero, J., Little, J.J. (2017). A simple yet effective baseline for 3D human pose estimation. In ICCV.

  • Mehta, D., Rhodin, H., Casas, D., Fua, P., Sotnychenko, O., Xu, W., & Theobalt, C. (2017). Monocular 3D human pose estimation in the wild using improved CNN supervision. In 3DV.

  • Misra, I., Zitnick, C. L., & Hebert, M. (2016). Shuffle and learn: Unsupervised learning using temporal order verification. In ECCV.

  • Mori, G., Pantofaru, C., Kothari, N., Leung, T., Toderici, G., Toshev, A., Yang, W. (2015). Pose embeddings: A deep architecture for learning to match human poses. ar**v preprint ar**v:1507.00302

  • Nie, B. X., **ong, C., & Zhu, S. C. (2015). Joint action recognition and pose estimation from video. In CVPR.

  • Oh, S.J., Murphy, K., Pan, J., Roth, J., Schroff, F., & Gallagher, A. (2019). Modeling uncertainty with hedged instance embedding. In ICLR.

  • Oh Song, H., **ang, Y., Jegelka, S., & Savarese, S. (2016). Deep metric learning via lifted structured feature embedding. In CVPR.

  • Ong, E. J., Micilotta, A. S., Bowden, R., & Hilton, A. (2006). Viewpoint invariant exemplar-based 3D human tracking. Computer Vision and Image Understanding, 104(2–3), 178–189.

    Article  Google Scholar 

  • Oord, A. V. D., Li, Y., Vinyals, O. (2018). Representation learning with contrastive predictive coding. In NeurIPS.

  • Papandreou, G., Zhu, T., Chen, L. C., Gidaris, S., Tompson, J., Murphy, K. (2018). PersonLab: Person pose estimation and instance segmentation with a bottom-up, part-based, geometric embedding model. In ECCV.

  • Papandreou, G., Zhu, T., Kanazawa, N., Toshev, A., Tompson, J., Bregler, C., & Murphy, K. (2017). Towards accurate multi-person pose estimation in the wild. In CVPR.

  • Parkhi, O. M., Vedaldi, A., & Zisserman, A. (2015). Deep face recognition. In BMVC.

  • Pavllo, D., Feichtenhofer, C., Grangier, D., & Auli, M. (2019). 3D human pose estimation in video with temporal convolutions and semi-supervised training. In CVPR.

  • Qiu, H., Wang, C., Wang, J., Wang, N., & Zeng, W. (2019). Cross view fusion for 3D human pose estimation. In ICCV.

  • Rao, C., & Shah, M. (2001). View-invariance in action recognition. In CVPR.

  • Rayat I. H. M., & Little, J. J. (2018). Exploiting temporal information for 3D human pose estimation. In ECCV.

  • Rhodin, H., Salzmann, M., & Fua, P. (2018). Unsupervised geometry-aware representation for 3D human pose estimation. In ECCV.

  • Rhodin, H., Spörri, J., Katircioglu, I., Constantin, V., Meyer, F., Müller, E., Salzmann, M., & Fua, P. (2018). Learning monocular 3D human pose estimation from multi-view images. In CVPR.

  • Ronchi, M. R., Kim, J. S., & Yue, Y. (2016). A rotation invariant latent factor model for moveme discovery from static poses. In ICDM.

  • Sárándi, I., Linder, T., Arras, K. O., & Leibe, B. (2018). Synthetic occlusion augmentation with volumetric heatmaps for the 2018 ECCV PoseTrack Challenge on 3D human pose estimation. ar**v preprint ar**v:1809.04987

  • Schroff, F., Kalenichenko, D., & Philbin, J. (2015). FaceNet: A unified embedding for face recognition and clustering. In CVPR.

  • Sermanet, P., Lynch, C., Chebotar, Y., Hsu, J., Jang, E., Schaal, S., Levine, S., & Brain, G. (2018). Time-contrastive networks: Self-supervised learning from video. In ICRA.

  • Sun, J. J., Zhao, J., Chen, L. C., Schroff, F., Adam, H., & Liu, T. (2020). View-invariant probabilistic embedding for human pose. In ECCV.

  • Sun, X., **ao, B., Wei, F., Liang, S., & Wei, Y. (2018). Integral human pose regression. In ECCV.

  • Tekin, B., Márquez-Neila, P., Salzmann, M., Fua, P. (2017). Learning to fuse 2D and 3D image cues for monocular body pose estimation. In ICCV.

  • Tome, D., Toso, M., Agapito, L., & Russell, C. (2018). Rethinking pose in 3D: Multi-stage refinement and recovery for markerless motion capture. In 3DV.

  • Vilnis, L., & McCallum, A. (2015). Word representations via Gaussian embedding. In ICLR.

  • Wang, J., Song, Y., Leung, T., Rosenberg, C., Wang, J., Philbin, J., Chen, B., & Wu, Y. (2014). Learning fine-grained image similarity with deep ranking. In CVPR.

  • Wohlhart, P., & Lepetit, V. (2015). Learning descriptors for object recognition and 3D pose estimation. In CVPR.

  • Wu, C. Y., Manmatha, R., Smola, A. J., & Krahenbuhl, P. (2017). Sampling matters in deep embedding learning. In ICCV.

  • **a, L., Chen, C. C., & Aggarwal, J. K. (2012). View invariant human action recognition using histograms of 3D joints. In CVPRW.

  • Zeng, A., Sun, X., Huang, F., Liu, M., Xu, Q., & Lin, S. (2020). SRNet: Improving generalization in 3D human pose estimation with a split-and-recombine approach. In ECCV.

  • Zhang, W., Zhu, M., & Derpanis, K. G. (2013). From actemes to action: A strongly-supervised representation for detailed action understanding. In ICCV.

  • Zhao, L., Peng, X., Tian, Y., Kapadia, M., & Metaxas, D. N. (2019). Semantic graph convolutional networks for 3D human pose regression. In CVPR.

  • Zheng, L., Huang, Y., Lu, H., & Yang, Y. (2019). Pose invariant embedding for deep person re-identification. IEEE Transactions on Image Processing, 28, 4500–4509.

    Article  MathSciNet  Google Scholar 

  • Zhou, X., Huang, Q., Sun, X., Xue, X., Wei, Y. (2017). Towards 3D human pose estimation in the wild: A weakly-supervised approach. In ICCV.

Download references

Acknowledgements

We would like to thank Debidatta Dwibedi, Kree Cole-McLaughlin and Andrew Gallagher from Google Research, ** Zhao, Liangzhe Yuan, Yuxiao Wang, Liang-Chieh Chen, Florian Schroff & Hartwig Adam

  • Authors

    Corresponding author

    Correspondence to Ting Liu.

    Additional information

    Communicated by Gregory Rogez.

    Appendices

    Appendix

    A Keypoint Definition

    Figure 16 illustrates the keypoints that we use in our experiments. The 16 keypoints we use to define a 3D pose are shown in Fig. 16a. We map 3D keypoints from different datasets to these 16 keypoints for training and evaluation in this paper. Besides most unambiguous map**s, several special map**s that we would like to note here are:

    • For the Human3.6M dataset (Ionescu et al. 2013), we discard the “Neck / Nose” keypoint and map the ”Thorax” keypoint to “Neck”.

    • For the MPI-INF-3DHP dataset (Mehta et al. 2017), we discard the “Head top” keypoint.

    • For the 3DPW dataset (von Marcard et al. 2018), we add “Pelvis” keypoint as the center of “Left hip” and “Right hip”, and add “Spine” as the center of “Pelvis” and “Neck”.

    The 13 2D keypoints we use to define a 2D pose are shown in Fig. 16b. We follow the COCO (Lin et al. 2014) keypoint definition, kee** all the 12 body keypoints and the “Nose” keypoint on the head.

    Rights and permissions

    Reprints and permissions

    About this article

    Check for updates. Verify currency and authenticity via CrossMark

    Cite this article

    Liu, T., Sun, J.J., Zhao, L. et al. View-Invariant, Occlusion-Robust Probabilistic Embedding for Human Pose. Int J Comput Vis 130, 111–135 (2022). https://doi.org/10.1007/s11263-021-01529-w

    Download citation

    • Received:

    • Accepted:

    • Published:

    • Issue Date:

    • DOI: https://doi.org/10.1007/s11263-021-01529-w

    Keywords

    Navigation