Single-Stage 3D Pose Estimation of Vulnerable Road Users Using Pseudo-Labels

Windbacher, Fabian; Hödlmoser, Michael; Gelautz, Margrit

doi:10.1007/978-3-031-31438-4_27

Fabian Windbacher¹⁰,
Michael Hödlmoser¹⁰ &
Margrit Gelautz¹¹

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13886))

Included in the following conference series:

Scandinavian Conference on Image Analysis

694 Accesses

Abstract

Human pose estimation of vulnerable road users is an important perception task for autonomous vehicles which can be exploited for intention prediction in order to guide the vehicle’s actions. Single-stage human pose estimation approaches with their potential in terms of simplicity and efficiency have shown only mediocre results in 2D, and have hardly been investigated in 3D in the autonomous driving domain so far. We tackle this challenge with the 2D single-stage human pose estimator KAPAO. We find that KAPAO achieves state-of-the-art performance in our evaluation on domain-specific 2D benchmark datasets, which motivates its extension for application in 3D. To overcome a lack of ground truth vulnerable road user data for 3D pose estimation, we first extend the Waymo Open Dataset with additional 3D pseudo-labels. We create more than one million 3D poses, that we estimate using the dataset’s exhaustive person bounding boxes and associated LiDAR point clouds. Evaluating their quality, we report a mean per joint position error of less than 10 cm. Having access to large-scale domain-specific 3D pose data, we propose a 3D variant of KAPAO that additionally predicts the depths of joints. We evaluate it on our extended Waymo Open Dataset and compare its performance to that of a LiDAR uplifting baseline. The proposed approach is low-latency and produces plausible poses but struggles to estimate absolute depth precisely, particularly at large distances. We alleviate that limitation by implementing a conditional LiDAR-based depth correction.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: EUR 29.95; Price includes VAT (Germany)

eBook: EUR 85.59; Price includes VAT (Germany)

Softcover Book: EUR 106.99; Price includes VAT (Germany)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Multi-person Absolute 3D Human Pose Estimation with Weak Depth Supervision

Lidar Point Cloud Guided Monocular 3D Object Detection

3D Bounding Boxes for Road Vehicles: A One-Stage, Localization Prioritized Approach Using Single Monocular Images

References

Bdd100k model zoo - pose estimation models of bdd100k. https://github.com/SysCV/bdd100k-models/tree/main/pose. Accessed 15 Nov 2022
UrbanPose: a new benchmark for VRU pose estimation in urban traffic scenes - leaderboard. https://urbanpose-dataset.com/info/Datasets/198. Accessed 15 Nov 2022
Brasó, G., Kister, N., Leal-Taixé, L.: The center of attention: center-keypoint grou** via attention for multi-person pose estimation. In: IEEE/CVF International Conference on Computer Vision (ICCV), pp. 11833–11843 (2021). https://doi.org/10.1109/ICCV48922.2021.01164
Cadena, P.R.G., Yang, M., Qian, Y., Wang, C.: Pedestrian graph: pedestrian crossing prediction based on 2D pose estimation and graph convolutional networks. In: IEEE Intelligent Transportation Systems Conference (ITSC), pp. 2000–2005 (2019). https://doi.org/10.1109/ITSC.2019.8917118
Chen, Y., Wang, Z., Peng, Y., Zhang, Z., Yu, G., Sun, J.: Cascaded pyramid network for multi-person pose estimation. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7103–7112 (2018). https://doi.org/10.1109/CVPR.2018.00742
Cheng, B., **ao, B., Wang, J., Shi, H., Huang, T.S., Zhang, L.: HigherHRNet: scale-aware representation learning for bottom-up human pose estimation. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5385–5394 (2020). https://doi.org/10.1109/CVPR42600.2020.00543
Fabbri, M., Lanzi, F., Calderara, S., Alletto, S., Cucchiara, R.: Compressed volumetric heatmaps for multi-person 3d pose estimation. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7202–7211 (2020). https://doi.org/10.1109/CVPR42600.2020.00723
Fang, H.S., **e, S., Tai, Y.W., Lu, C.: RMPE: regional multi-person pose estimation. In: IEEE/CVF International Conference on Computer Vision (ICCV), pp. 2353–2362 (2017). https://doi.org/10.1109/ICCV.2017.256
Fang, Z., Zhang, W., Guo, Z., Zhi, R., Wang, B., Flohr, F.: Traffic police gesture recognition by pose graph convolutional networks. In: IEEE Intelligent Vehicles Symposium (IV), pp. 1833–1838 (2020). https://doi.org/10.1109/IV47402.2020.9304675
Fürst, M., Gupta, S.T.P., Schuster, R., Wasenmüller, O., Stricker, D.: HPERL: 3D human pose estimation from RGB and LiDAR. In: 25th International Conference on Pattern Recognition (ICPR), pp. 7321–7327 (2021). https://doi.org/10.1109/ICPR48806.2021.9412785
Geng, Z., Sun, K., **ao, B., Zhang, Z., Wang, J.: Bottom-up human pose estimation via disentangled keypoint regression. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 14676–14686 (2021)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778 (2016). https://doi.org/10.1109/CVPR.2016.90
Huang, J., Zhu, Z., Guo, F., Huang, G.: The devil is in the details: delving into unbiased data processing for human pose estimation. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5699–5708 (2020). https://doi.org/10.1109/CVPR42600.2020.00574
**, L., Xu, C., Wang, X., **ao, Y., Guo, Y., Nie, X., Zhao, J.: Single-stage is enough: multi-person absolute 3d pose estimation. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 13076–13085 (2022). https://doi.org/10.1109/CVPR52688.2022.01274
Jocher, G., et al.: ultralytics/YOLOv5: v6.1 - TensorRT, TensorFlow Edge TPU and OpenVINO Export and Inference (2022). https://doi.org/10.5281/zenodo.6222936
Kreiss, S., Bertoni, L., Alahi, A.: PifPaf: composite fields for human pose estimation. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11969–11978 (2019). https://doi.org/10.1109/CVPR.2019.01225
Kumar, C., et al.: VRU Pose-SSD: Multiperson pose estimation for automated driving. In: AAAI Conference on Artificial Intelligence, pp. 15331–15338 (2021). https://doi.org/10.1609/aaai.v35i17.17800
Liang, J., Jiang, L., Niebles, J.C., Hauptmann, A.G., Fei-Fei, L.: Peeking into the future: Predicting future person activities and locations in videos. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5718–5727 (2019). https://doi.org/10.1109/CVPR.2019.00587
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Chapter Google Scholar
Luo, Z., Wang, Z., Huang, Y., Wang, L., Tan, T., Zhou, E.: Rethinking the heatmap regression for bottom-up human pose estimation. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 13259–13268 (2021). https://doi.org/10.1109/CVPR46437.2021.01306
Luvizon, D.C., Picard, D., Tabia, H.: 2D/3D pose estimation and action recognition using multitask deep learning. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5137–5146 (2018). https://doi.org/10.1109/CVPR.2018.00539
Luvizon, D.C., Tabia, H., Picard, D.: Human pose regression by combining indirect part detection and contextual information. Comput. Graph. 85, 15–22 (2019). https://doi.org/10.1016/j.cag.2019.09.002
Article Google Scholar
Maji, D., Nagori, S., Mathew, M., Poddar, D.: YOLO-POSE: enhancing yolo for multi person pose estimation using object keypoint similarity loss. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 2636–2645 (2022). https://doi.org/10.1109/CVPRW56347.2022.00297
Mauri, A., Khemmar, R., Decoux, B., Haddad, M., Boutteau, R.: Lightweight convolutional neural network for real-time 3D object detection in road and railway environments. J. Real-Time Image Proc. 19(3), 499–516 (2022). https://doi.org/10.1007/s11554-022-01202-6
Article Google Scholar
McNally, W., Vats, K., Wong, A., McPhee, J.: Rethinking keypoint representations: modeling keypoints and poses as objects for multi-person human pose estimation. In: Avidan, S., Brostow, G., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol. 13666. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20068-7_3
Moon, G., Chang, J.Y., Lee, K.M.: Camera distance-aware top-down approach for 3D multi-person pose estimation from a single RGB image. In: IEEE/CVF International Conference on Computer Vision (ICCV), pp. 10132–10141 (2019). https://doi.org/10.1109/ICCV.2019.01023
Nie, X., Feng, J., Zhang, J., Yan, S.: Single-stage multi-person pose machines. In: IEEE/CVF International Conference on Computer Vision (ICCV), pp. 6950–6959 (2019). https://doi.org/10.1109/ICCV.2019.00705
Pavlakos, G., Zhou, X., Derpanis, K.G., Daniilidis, K.: Coarse-to-fine volumetric prediction for single-image 3d human pose. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1263–1272 (2017). https://doi.org/10.1109/CVPR.2017.139
Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., Chen, L.C.: MobileNetV2: inverted residuals and linear bottlenecks. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4510–4520 (2018). https://doi.org/10.1109/CVPR.2018.00474
Sun, K., **ao, B., Liu, D., Wang, J.: Deep high-resolution representation learning for human pose estimation. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5686–5696 (2019). https://doi.org/10.1109/CVPR.2019.00584
Sun, P., et al.: Scalability in perception for autonomous driving: Waymo open dataset. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2443–2451 (2020). https://doi.org/10.1109/CVPR42600.2020.00252
Sun, X., **ao, B., Wei, F., Liang, S., Wei, Y.: Integral human pose regression. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11210, pp. 536–553. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01231-1_33
Chapter Google Scholar
Wang, S., et al.: Leverage of limb detection in pose estimation for vulnerable road users. In: IEEE Intelligent Transportation Systems Conference (ITSC), pp. 528–534 (2019). https://doi.org/10.1109/ITSC.2019.8917065
Wang, S., et al.: UrbanPose: A new benchmark for VRU pose estimation in urban traffic scenes. In: IEEE Intelligent Vehicles Symposium (IV), pp. 1537–1544 (2021). https://doi.org/10.1109/IV48863.2021.9575469
Wang, Z., Nie, X., Qu, X., Chen, Y., Liu, S.: Distribution-aware single-stage models for multi-person 3d pose estimation. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 13086–13095 (2022). https://doi.org/10.1109/CVPR52688.2022.01275
Yu, F., et al.: BDD100K: a diverse driving dataset for heterogeneous multitask learning. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2633–2642 (2020). https://doi.org/10.1109/CVPR42600.2020.00271
Zanfir, A., Marinoiu, E., Zanfir, M., Popa, A.I., Sminchisescu, C.: Deep network for the integrated 3d sensing of multiple people in natural images. In: 32nd International Conference on Neural Information Processing Systems (NeurIPS), pp. 8420–8429 (2018). https://doi.org/10.5555/3327757.3327933
Zhang, F., Zhu, X., Dai, H., Ye, M., Zhu, C.: Distribution-aware coordinate representation for human pose estimation. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7091–7100 (2020). https://doi.org/10.1109/CVPR42600.2020.00712
Zheng, J., et al.: Multi-modal 3D human pose estimation with 2d weak supervision in autonomous driving. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 4477–4486 (2022). https://doi.org/10.1109/CVPRW56347.2022.00494

Download references

Acknowledgement

This work was partly supported by the SmartProtect project (no. 879642), which is funded through the Austrian Research Promotion Agency (FFG) on behalf of the Austrian Ministry of Climate Action (BMK) via its Mobility of the Future funding program, and the European Union’s H2020 Fast Track to Innovation project SmartRCS (no. 971619).

Author information

Authors and Affiliations

emotion3D, Rainergasse 1, 1040, Vienna, Austria
Fabian Windbacher & Michael Hödlmoser
TU Wien, Karlsplatz 13, 1040, Vienna, Austria
Margrit Gelautz

Authors

Fabian Windbacher
View author publications
You can also search for this author in PubMed Google Scholar
Michael Hödlmoser
View author publications
You can also search for this author in PubMed Google Scholar
Margrit Gelautz
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Fabian Windbacher .

Editor information

Editors and Affiliations

Aalborg University, Aalborg, Denmark
Rikke Gade
Linkö** University, Linkö**, Sweden
Michael Felsberg
Tampere University, Tampere, Finland
Joni-Kristian Kämäräinen

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Windbacher, F., Hödlmoser, M., Gelautz, M. (2023). Single-Stage 3D Pose Estimation of Vulnerable Road Users Using Pseudo-Labels. In: Gade, R., Felsberg, M., Kämäräinen, JK. (eds) Image Analysis. SCIA 2023. Lecture Notes in Computer Science, vol 13886. Springer, Cham. https://doi.org/10.1007/978-3-031-31438-4_27

Download citation

DOI: https://doi.org/10.1007/978-3-031-31438-4_27
Published: 27 April 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-31437-7
Online ISBN: 978-3-031-31438-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)

Single-Stage 3D Pose Estimation of Vulnerable Road Users Using Pseudo-Labels

Abstract

Access this chapter

Similar content being viewed by others

Multi-person Absolute 3D Human Pose Estimation with Weak Depth Supervision

Lidar Point Cloud Guided Monocular 3D Object Detection

3D Bounding Boxes for Road Vehicles: A One-Stage, Localization Prioritized Approach Using Single Monocular Images

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Societies and partnerships

Navigation

Single-Stage 3D Pose Estimation of Vulnerable Road Users Using Pseudo-Labels

Abstract

Access this chapter

Similar content being viewed by others

Multi-person Absolute 3D Human Pose Estimation with Weak Depth Supervision

Lidar Point Cloud Guided Monocular 3D Object Detection

3D Bounding Boxes for Road Vehicles: A One-Stage, Localization Prioritized Approach Using Single Monocular Images

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Societies and partnerships

Search

Navigation