Abstract
Recognition of human poses and actions is crucial for autonomous systems to interact smoothly with people. However, cameras generally capture human poses in 2D as images and videos, which can have significant appearance variations across viewpoints that make the recognition tasks challenging. To address this, we explore recognizing similarity in 3D human body poses from 2D information, which has not been well-studied in existing works. Here, we propose an approach to learning a compact view-invariant embedding space from 2D body joint keypoints, without explicitly predicting 3D poses. Input ambiguities of 2D poses from projection and occlusion are difficult to represent through a deterministic map**, and therefore we adopt a probabilistic formulation for our embedding space. Experimental results show that our embedding model achieves higher accuracy when retrieving similar poses across different camera views, in comparison with 3D pose estimation models. We also show that by training a simple temporal embedding model, we achieve superior performance on pose sequence retrieval and largely reduce the embedding dimension from stacking frame-based embeddings for efficient large-scale retrieval. Furthermore, in order to enable our embeddings to work with partially visible input, we further investigate different keypoint occlusion augmentation strategies during training. We demonstrate that these occlusion augmentations significantly improve retrieval performance on partial 2D input poses. Results on action recognition and video alignment demonstrate that using our embeddings without any additional training achieves competitive performance relative to other models specifically trained for each task.
Similar content being viewed by others
References
Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G.S., Davis, A., Dean, J., Devin, M., Ghemawat, S., Goodfellow, I., Harp, A., Irving, G., Isard, M., Jia, Y., Jozefowicz, R., Kaiser, L., Kudlur, M., Levenberg, J., Mané, D., Monga, R., Moore, S., Murray, D., Olah, C., Schuster, M., Shlens, J., Steiner, B., Sutskever, I., Talwar, K., Tucker, P., Vanhoucke, V., Vasudevan, V., Viégas, F., Vinyals, O., Warden, P., Wattenberg, M., Wicke, M., Yu, Y., & Zheng, X. (2015). TensorFlow: Large-scale machine learning on heterogeneous systems. https://www.tensorflow.org/. Software available from tensorflow.org
Akhter, I., & Black, M. J. (2015). Pose-conditioned joint angle limits for 3D human pose reconstruction. In CVPR.
Andriluka, M., Pishchulin, L., Gehler, P., & Schiele, B. (2014). 2D human pose estimation: New benchmark and state of the art analysis. In CVPR.
Bojchevski, A., & Günnemann, S. (2018). Deep Gaussian embedding of graphs: Unsupervised inductive learning via ranking. In ICLR.
Bromley, J., Guyon, I., LeCun, Y., Säckinger, E., & Shah, R. (1994) . Signature verification using a “siamese” time delay neural network. In NeurIPS.
Cao, C., Zhang, Y., Zhang, C., & Lu, H. (2017). Body joint guided 3-D deep convolutional descriptors for action recognition. IEEE Transactions on Cybernetics, 48(3), 1095–1108.
Chen, C. H., & Ramanan, D. (2017). 3D human pose estimation = 2D pose estimation + matching. In CVPR.
Chen, C. H., Tyagi, A., Agrawal, A., Drover, D., Stojanov, S., & Rehg, J. M. (2019). Unsupervised 3D pose estimation with geometric self-supervision. In CVPR.
Chen, T., Kornblith, S., Norouzi, M., & Hinton, G. (2020). A simple framework for contrastive learning of visual representations. In ICML.
Chen, Y., Wang, Z., Peng, Y., Zhang, Z., Yu, G., & Sun, J. (2018). Cascaded pyramid network for multi-person pose estimation. In CVPR.
Cheng, Y., Yang, B., Wang, B., & Tan, R. T. (2020). 3D human pose estimation using spatio-temporal networks with explicit occlusion training. In AAAI.
Cheng, Y., Yang, B., Wang, B., Yan, W., & Tan, R. T. (2019). Occlusion-aware networks for 3D human pose estimation in video. In ICCV.
Chu, R., Sun, Y., Li, Y., Liu, Z., Zhang, C., & Wei, Y. (2019). Vehicle re-identification with viewpoint-aware metric learning. In ICCV.
Du, W., Wang, Y., & Qiao, Y. (2017). RPAN: An end-to-end recurrent pose-attention network for action recognition in videos. In ICCV.
Duchi, J., Hazan, E., & Singer, Y. (2011). Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul), 2121–2159.
Dwibedi, D., Aytar, Y., Tompson, J., Sermanet, P., & Zisserman, A. (2019). Temporal cycle-consistency learning. In CVPR.
Garcia-Salguero, M., Gonzalez-Jimenez, J., & Moreno, F. A. (2019). Human 3D pose estimation with a tilting camera for social mobile robot interaction. Sensors, 19(22), 4943.
Gu, R., Wang, G., & Hwang, J. N. (2019). Efficient multi-person hierarchical 3D pose estimation for autonomous driving. In MIPR.
Hadsell, R., Chopra, S., & LeCun, Y. (2006). Dimensionality reduction by learning an invariant map**. In CVPR.
He, K., Gkioxari, G., Dollár, P., & Girshick, R. (2017). Mask R-CNN. In CVPR.
He, X., Zhou, Y., Zhou, Z., Bai, S., & Bai, X. (2018). Triplet-center loss for multi-view 3D object retrieval. In CVPR.
Hermans, A., Beyer, L., & Leibe, B. (2017). In defense of the triplet loss for person re-identification. ar**v preprint ar**v:1703.07737.
Ho, C. H., Morgado, P., Persekian, A., Vasconcelos, N. (2019). PIEs: Pose invariant embeddings. In CVPR.
Hu, W., & Zhu, S. C. (2010). Learning a probabilistic model mixing 3D and 2D primitives for view invariant object recognition. In CVPR.
Huang, C., Loy, C. C., & Tang, X. (2016). Local similarity-aware deep feature embedding. In NeurIPS.
Ionescu, C., Papava, D., Olaru, V., & Sminchisescu, C. (2013). Human3.6M: Large scale datasets and predictive methods for 3D human sensing in natural environments. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(7), 1325–1339.
Iqbal, U., Garbade, M., & Gall, J. (2017). Pose for action-action for pose. In FG.
Iscen, A., Tolias, G., Avrithis, Y., & Chum, O. (2018). Mining on manifolds: Metric learning without labels. In CVPR.
Iskakov, K., Burkov, E., Lempitsky, V., & Malkov, Y. (2019). Learnable triangulation of human pose. In ICCV.
Jammalamadaka, N., Zisserman, A., Eichner, M., Ferrari, V., & Jawahar, C. (2012). Video retrieval by mimicking poses. In ACM MM.
Ji, X., & Liu, H. (2009). Advances in view-invariant human motion analysis: A review. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 40(1), 13–24.
Ji, X., Liu, H., Li, Y., & Brown, D. (2008). Visual-based view-invariant human motion analysis: A review. In KES.
Kendall, A., & Gal, Y. (2017). What uncertainties do we need in Bayesian deep learning for computer vision? In NeurIPS.
Kendall, M. G. (1938). A new measure of rank correlation. Biometrika, 30(1/2), 81–93.
Kingma, D. P., & Welling, M. (2014). Auto-encoding variational Bayes. In ICLR.
Kipf, T. N., & Welling, M. (2017). Semi-supervised classification with graph convolutional networks. In ICLR.
Kocabas, M., Karagoz, S., & Akbas, E. (2019). Self-supervised learning of 3D human pose using multi-view geometry. In CVPR.
LeCun, Y., Huang, F. J., & Bottou, L., et al. (2004). Learning methods for generic object recognition with invariance to pose and lighting. In CVPR.
Li, J., Wong, Y., Zhao, Q., & Kankanhalli, M. (2018). Unsupervised learning of view-invariant action representations. In NeurIPS.
Li, S., Ke, L., Pratama, K., Tai, Y. W., Tang, C. K., & Cheng, K. T. (2020). Cascaded deep monocular 3D human pose estimation with evolutionary training data. In CVPR.
Lin, T. Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C. L. (2014). Microsoft COCO: Common objects in context. In ECCV.
Liu, J., Akhtar, N., & Ajmal, M. (2018). Viewpoint invariant action recognition using RGB-D videos. IEEE Access, 6, 70061–70071.
Liu, M., Yuan, J. (2018). Recognizing human actions as the evolution of pose estimation maps. In CVPR.
Luvizon, D. C., Tabia, H., & Picard, D. (2020). Multi-task deep learning for real-time 3D human pose estimation and action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence.
von Marcard, T., Henschel, R., Black, M., Rosenhahn, B., & Pons-Moll, G. (2018). Recovering accurate 3D human pose in the wild using IMUs and a moving camera. In ECCV.
Martinez, J., Hossain, R., Romero, J., Little, J.J. (2017). A simple yet effective baseline for 3D human pose estimation. In ICCV.
Mehta, D., Rhodin, H., Casas, D., Fua, P., Sotnychenko, O., Xu, W., & Theobalt, C. (2017). Monocular 3D human pose estimation in the wild using improved CNN supervision. In 3DV.
Misra, I., Zitnick, C. L., & Hebert, M. (2016). Shuffle and learn: Unsupervised learning using temporal order verification. In ECCV.
Mori, G., Pantofaru, C., Kothari, N., Leung, T., Toderici, G., Toshev, A., Yang, W. (2015). Pose embeddings: A deep architecture for learning to match human poses. ar**v preprint ar**v:1507.00302
Nie, B. X., **ong, C., & Zhu, S. C. (2015). Joint action recognition and pose estimation from video. In CVPR.
Oh, S.J., Murphy, K., Pan, J., Roth, J., Schroff, F., & Gallagher, A. (2019). Modeling uncertainty with hedged instance embedding. In ICLR.
Oh Song, H., **ang, Y., Jegelka, S., & Savarese, S. (2016). Deep metric learning via lifted structured feature embedding. In CVPR.
Ong, E. J., Micilotta, A. S., Bowden, R., & Hilton, A. (2006). Viewpoint invariant exemplar-based 3D human tracking. Computer Vision and Image Understanding, 104(2–3), 178–189.
Oord, A. V. D., Li, Y., Vinyals, O. (2018). Representation learning with contrastive predictive coding. In NeurIPS.
Papandreou, G., Zhu, T., Chen, L. C., Gidaris, S., Tompson, J., Murphy, K. (2018). PersonLab: Person pose estimation and instance segmentation with a bottom-up, part-based, geometric embedding model. In ECCV.
Papandreou, G., Zhu, T., Kanazawa, N., Toshev, A., Tompson, J., Bregler, C., & Murphy, K. (2017). Towards accurate multi-person pose estimation in the wild. In CVPR.
Parkhi, O. M., Vedaldi, A., & Zisserman, A. (2015). Deep face recognition. In BMVC.
Pavllo, D., Feichtenhofer, C., Grangier, D., & Auli, M. (2019). 3D human pose estimation in video with temporal convolutions and semi-supervised training. In CVPR.
Qiu, H., Wang, C., Wang, J., Wang, N., & Zeng, W. (2019). Cross view fusion for 3D human pose estimation. In ICCV.
Rao, C., & Shah, M. (2001). View-invariance in action recognition. In CVPR.
Rayat I. H. M., & Little, J. J. (2018). Exploiting temporal information for 3D human pose estimation. In ECCV.
Rhodin, H., Salzmann, M., & Fua, P. (2018). Unsupervised geometry-aware representation for 3D human pose estimation. In ECCV.
Rhodin, H., Spörri, J., Katircioglu, I., Constantin, V., Meyer, F., Müller, E., Salzmann, M., & Fua, P. (2018). Learning monocular 3D human pose estimation from multi-view images. In CVPR.
Ronchi, M. R., Kim, J. S., & Yue, Y. (2016). A rotation invariant latent factor model for moveme discovery from static poses. In ICDM.
Sárándi, I., Linder, T., Arras, K. O., & Leibe, B. (2018). Synthetic occlusion augmentation with volumetric heatmaps for the 2018 ECCV PoseTrack Challenge on 3D human pose estimation. ar**v preprint ar**v:1809.04987
Schroff, F., Kalenichenko, D., & Philbin, J. (2015). FaceNet: A unified embedding for face recognition and clustering. In CVPR.
Sermanet, P., Lynch, C., Chebotar, Y., Hsu, J., Jang, E., Schaal, S., Levine, S., & Brain, G. (2018). Time-contrastive networks: Self-supervised learning from video. In ICRA.
Sun, J. J., Zhao, J., Chen, L. C., Schroff, F., Adam, H., & Liu, T. (2020). View-invariant probabilistic embedding for human pose. In ECCV.
Sun, X., **ao, B., Wei, F., Liang, S., & Wei, Y. (2018). Integral human pose regression. In ECCV.
Tekin, B., Márquez-Neila, P., Salzmann, M., Fua, P. (2017). Learning to fuse 2D and 3D image cues for monocular body pose estimation. In ICCV.
Tome, D., Toso, M., Agapito, L., & Russell, C. (2018). Rethinking pose in 3D: Multi-stage refinement and recovery for markerless motion capture. In 3DV.
Vilnis, L., & McCallum, A. (2015). Word representations via Gaussian embedding. In ICLR.
Wang, J., Song, Y., Leung, T., Rosenberg, C., Wang, J., Philbin, J., Chen, B., & Wu, Y. (2014). Learning fine-grained image similarity with deep ranking. In CVPR.
Wohlhart, P., & Lepetit, V. (2015). Learning descriptors for object recognition and 3D pose estimation. In CVPR.
Wu, C. Y., Manmatha, R., Smola, A. J., & Krahenbuhl, P. (2017). Sampling matters in deep embedding learning. In ICCV.
**a, L., Chen, C. C., & Aggarwal, J. K. (2012). View invariant human action recognition using histograms of 3D joints. In CVPRW.
Zeng, A., Sun, X., Huang, F., Liu, M., Xu, Q., & Lin, S. (2020). SRNet: Improving generalization in 3D human pose estimation with a split-and-recombine approach. In ECCV.
Zhang, W., Zhu, M., & Derpanis, K. G. (2013). From actemes to action: A strongly-supervised representation for detailed action understanding. In ICCV.
Zhao, L., Peng, X., Tian, Y., Kapadia, M., & Metaxas, D. N. (2019). Semantic graph convolutional networks for 3D human pose regression. In CVPR.
Zheng, L., Huang, Y., Lu, H., & Yang, Y. (2019). Pose invariant embedding for deep person re-identification. IEEE Transactions on Image Processing, 28, 4500–4509.
Zhou, X., Huang, Q., Sun, X., Xue, X., Wei, Y. (2017). Towards 3D human pose estimation in the wild: A weakly-supervised approach. In ICCV.
Acknowledgements
We would like to thank Debidatta Dwibedi, Kree Cole-McLaughlin and Andrew Gallagher from Google Research, ** Zhao, Liangzhe Yuan, Yuxiao Wang, Liang-Chieh Chen, Florian Schroff & Hartwig Adam
Corresponding author
Additional information
Communicated by Gregory Rogez.
Appendices
Appendix
A Keypoint Definition
Figure 16 illustrates the keypoints that we use in our experiments. The 16 keypoints we use to define a 3D pose are shown in Fig. 16a. We map 3D keypoints from different datasets to these 16 keypoints for training and evaluation in this paper. Besides most unambiguous map**s, several special map**s that we would like to note here are:
-
For the Human3.6M dataset (Ionescu et al. 2013), we discard the “Neck / Nose” keypoint and map the ”Thorax” keypoint to “Neck”.
-
For the MPI-INF-3DHP dataset (Mehta et al. 2017), we discard the “Head top” keypoint.
-
For the 3DPW dataset (von Marcard et al. 2018), we add “Pelvis” keypoint as the center of “Left hip” and “Right hip”, and add “Spine” as the center of “Pelvis” and “Neck”.
The 13 2D keypoints we use to define a 2D pose are shown in Fig. 16b. We follow the COCO (Lin et al. 2014) keypoint definition, kee** all the 12 body keypoints and the “Nose” keypoint on the head.
Rights and permissions
About this article
Cite this article
Liu, T., Sun, J.J., Zhao, L. et al. View-Invariant, Occlusion-Robust Probabilistic Embedding for Human Pose. Int J Comput Vis 130, 111–135 (2022). https://doi.org/10.1007/s11263-021-01529-w
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11263-021-01529-w