Single-Stage 3D Pose Estimation of Vulnerable Road Users Using Pseudo-Labels

  • Conference paper
  • First Online:
Image Analysis (SCIA 2023)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13886))

Included in the following conference series:

  • 694 Accesses

Abstract

Human pose estimation of vulnerable road users is an important perception task for autonomous vehicles which can be exploited for intention prediction in order to guide the vehicle’s actions. Single-stage human pose estimation approaches with their potential in terms of simplicity and efficiency have shown only mediocre results in 2D, and have hardly been investigated in 3D in the autonomous driving domain so far. We tackle this challenge with the 2D single-stage human pose estimator KAPAO. We find that KAPAO achieves state-of-the-art performance in our evaluation on domain-specific 2D benchmark datasets, which motivates its extension for application in 3D. To overcome a lack of ground truth vulnerable road user data for 3D pose estimation, we first extend the Waymo Open Dataset with additional 3D pseudo-labels. We create more than one million 3D poses, that we estimate using the dataset’s exhaustive person bounding boxes and associated LiDAR point clouds. Evaluating their quality, we report a mean per joint position error of less than 10 cm. Having access to large-scale domain-specific 3D pose data, we propose a 3D variant of KAPAO that additionally predicts the depths of joints. We evaluate it on our extended Waymo Open Dataset and compare its performance to that of a LiDAR uplifting baseline. The proposed approach is low-latency and produces plausible poses but struggles to estimate absolute depth precisely, particularly at large distances. We alleviate that limitation by implementing a conditional LiDAR-based depth correction.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
EUR 29.95
Price includes VAT (Germany)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
EUR 85.59
Price includes VAT (Germany)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
EUR 106.99
Price includes VAT (Germany)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free ship** worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Bdd100k model zoo - pose estimation models of bdd100k. https://github.com/SysCV/bdd100k-models/tree/main/pose. Accessed 15 Nov 2022

  2. UrbanPose: a new benchmark for VRU pose estimation in urban traffic scenes - leaderboard. https://urbanpose-dataset.com/info/Datasets/198. Accessed 15 Nov 2022

  3. Brasó, G., Kister, N., Leal-Taixé, L.: The center of attention: center-keypoint grou** via attention for multi-person pose estimation. In: IEEE/CVF International Conference on Computer Vision (ICCV), pp. 11833–11843 (2021). https://doi.org/10.1109/ICCV48922.2021.01164

  4. Cadena, P.R.G., Yang, M., Qian, Y., Wang, C.: Pedestrian graph: pedestrian crossing prediction based on 2D pose estimation and graph convolutional networks. In: IEEE Intelligent Transportation Systems Conference (ITSC), pp. 2000–2005 (2019). https://doi.org/10.1109/ITSC.2019.8917118

  5. Chen, Y., Wang, Z., Peng, Y., Zhang, Z., Yu, G., Sun, J.: Cascaded pyramid network for multi-person pose estimation. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7103–7112 (2018). https://doi.org/10.1109/CVPR.2018.00742

  6. Cheng, B., **ao, B., Wang, J., Shi, H., Huang, T.S., Zhang, L.: HigherHRNet: scale-aware representation learning for bottom-up human pose estimation. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5385–5394 (2020). https://doi.org/10.1109/CVPR42600.2020.00543

  7. Fabbri, M., Lanzi, F., Calderara, S., Alletto, S., Cucchiara, R.: Compressed volumetric heatmaps for multi-person 3d pose estimation. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7202–7211 (2020). https://doi.org/10.1109/CVPR42600.2020.00723

  8. Fang, H.S., **e, S., Tai, Y.W., Lu, C.: RMPE: regional multi-person pose estimation. In: IEEE/CVF International Conference on Computer Vision (ICCV), pp. 2353–2362 (2017). https://doi.org/10.1109/ICCV.2017.256

  9. Fang, Z., Zhang, W., Guo, Z., Zhi, R., Wang, B., Flohr, F.: Traffic police gesture recognition by pose graph convolutional networks. In: IEEE Intelligent Vehicles Symposium (IV), pp. 1833–1838 (2020). https://doi.org/10.1109/IV47402.2020.9304675

  10. Fürst, M., Gupta, S.T.P., Schuster, R., Wasenmüller, O., Stricker, D.: HPERL: 3D human pose estimation from RGB and LiDAR. In: 25th International Conference on Pattern Recognition (ICPR), pp. 7321–7327 (2021). https://doi.org/10.1109/ICPR48806.2021.9412785

  11. Geng, Z., Sun, K., **ao, B., Zhang, Z., Wang, J.: Bottom-up human pose estimation via disentangled keypoint regression. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 14676–14686 (2021)

    Google Scholar 

  12. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778 (2016). https://doi.org/10.1109/CVPR.2016.90

  13. Huang, J., Zhu, Z., Guo, F., Huang, G.: The devil is in the details: delving into unbiased data processing for human pose estimation. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5699–5708 (2020). https://doi.org/10.1109/CVPR42600.2020.00574

  14. **, L., Xu, C., Wang, X., **ao, Y., Guo, Y., Nie, X., Zhao, J.: Single-stage is enough: multi-person absolute 3d pose estimation. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 13076–13085 (2022). https://doi.org/10.1109/CVPR52688.2022.01274

  15. Jocher, G., et al.: ultralytics/YOLOv5: v6.1 - TensorRT, TensorFlow Edge TPU and OpenVINO Export and Inference (2022). https://doi.org/10.5281/zenodo.6222936

  16. Kreiss, S., Bertoni, L., Alahi, A.: PifPaf: composite fields for human pose estimation. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11969–11978 (2019). https://doi.org/10.1109/CVPR.2019.01225

  17. Kumar, C., et al.: VRU Pose-SSD: Multiperson pose estimation for automated driving. In: AAAI Conference on Artificial Intelligence, pp. 15331–15338 (2021). https://doi.org/10.1609/aaai.v35i17.17800

  18. Liang, J., Jiang, L., Niebles, J.C., Hauptmann, A.G., Fei-Fei, L.: Peeking into the future: Predicting future person activities and locations in videos. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5718–5727 (2019). https://doi.org/10.1109/CVPR.2019.00587

  19. Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48

    Chapter  Google Scholar 

  20. Luo, Z., Wang, Z., Huang, Y., Wang, L., Tan, T., Zhou, E.: Rethinking the heatmap regression for bottom-up human pose estimation. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 13259–13268 (2021). https://doi.org/10.1109/CVPR46437.2021.01306

  21. Luvizon, D.C., Picard, D., Tabia, H.: 2D/3D pose estimation and action recognition using multitask deep learning. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5137–5146 (2018). https://doi.org/10.1109/CVPR.2018.00539

  22. Luvizon, D.C., Tabia, H., Picard, D.: Human pose regression by combining indirect part detection and contextual information. Comput. Graph. 85, 15–22 (2019). https://doi.org/10.1016/j.cag.2019.09.002

    Article  Google Scholar 

  23. Maji, D., Nagori, S., Mathew, M., Poddar, D.: YOLO-POSE: enhancing yolo for multi person pose estimation using object keypoint similarity loss. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 2636–2645 (2022). https://doi.org/10.1109/CVPRW56347.2022.00297

  24. Mauri, A., Khemmar, R., Decoux, B., Haddad, M., Boutteau, R.: Lightweight convolutional neural network for real-time 3D object detection in road and railway environments. J. Real-Time Image Proc. 19(3), 499–516 (2022). https://doi.org/10.1007/s11554-022-01202-6

    Article  Google Scholar 

  25. McNally, W., Vats, K., Wong, A., McPhee, J.: Rethinking keypoint representations: modeling keypoints and poses as objects for multi-person human pose estimation. In: Avidan, S., Brostow, G., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol. 13666. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20068-7_3

  26. Moon, G., Chang, J.Y., Lee, K.M.: Camera distance-aware top-down approach for 3D multi-person pose estimation from a single RGB image. In: IEEE/CVF International Conference on Computer Vision (ICCV), pp. 10132–10141 (2019). https://doi.org/10.1109/ICCV.2019.01023

  27. Nie, X., Feng, J., Zhang, J., Yan, S.: Single-stage multi-person pose machines. In: IEEE/CVF International Conference on Computer Vision (ICCV), pp. 6950–6959 (2019). https://doi.org/10.1109/ICCV.2019.00705

  28. Pavlakos, G., Zhou, X., Derpanis, K.G., Daniilidis, K.: Coarse-to-fine volumetric prediction for single-image 3d human pose. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1263–1272 (2017). https://doi.org/10.1109/CVPR.2017.139

  29. Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., Chen, L.C.: MobileNetV2: inverted residuals and linear bottlenecks. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4510–4520 (2018). https://doi.org/10.1109/CVPR.2018.00474

  30. Sun, K., **ao, B., Liu, D., Wang, J.: Deep high-resolution representation learning for human pose estimation. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5686–5696 (2019). https://doi.org/10.1109/CVPR.2019.00584

  31. Sun, P., et al.: Scalability in perception for autonomous driving: Waymo open dataset. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2443–2451 (2020). https://doi.org/10.1109/CVPR42600.2020.00252

  32. Sun, X., **ao, B., Wei, F., Liang, S., Wei, Y.: Integral human pose regression. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11210, pp. 536–553. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01231-1_33

    Chapter  Google Scholar 

  33. Wang, S., et al.: Leverage of limb detection in pose estimation for vulnerable road users. In: IEEE Intelligent Transportation Systems Conference (ITSC), pp. 528–534 (2019). https://doi.org/10.1109/ITSC.2019.8917065

  34. Wang, S., et al.: UrbanPose: A new benchmark for VRU pose estimation in urban traffic scenes. In: IEEE Intelligent Vehicles Symposium (IV), pp. 1537–1544 (2021). https://doi.org/10.1109/IV48863.2021.9575469

  35. Wang, Z., Nie, X., Qu, X., Chen, Y., Liu, S.: Distribution-aware single-stage models for multi-person 3d pose estimation. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 13086–13095 (2022). https://doi.org/10.1109/CVPR52688.2022.01275

  36. Yu, F., et al.: BDD100K: a diverse driving dataset for heterogeneous multitask learning. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2633–2642 (2020). https://doi.org/10.1109/CVPR42600.2020.00271

  37. Zanfir, A., Marinoiu, E., Zanfir, M., Popa, A.I., Sminchisescu, C.: Deep network for the integrated 3d sensing of multiple people in natural images. In: 32nd International Conference on Neural Information Processing Systems (NeurIPS), pp. 8420–8429 (2018). https://doi.org/10.5555/3327757.3327933

  38. Zhang, F., Zhu, X., Dai, H., Ye, M., Zhu, C.: Distribution-aware coordinate representation for human pose estimation. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7091–7100 (2020). https://doi.org/10.1109/CVPR42600.2020.00712

  39. Zheng, J., et al.: Multi-modal 3D human pose estimation with 2d weak supervision in autonomous driving. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 4477–4486 (2022). https://doi.org/10.1109/CVPRW56347.2022.00494

Download references

Acknowledgement

This work was partly supported by the SmartProtect project (no. 879642), which is funded through the Austrian Research Promotion Agency (FFG) on behalf of the Austrian Ministry of Climate Action (BMK) via its Mobility of the Future funding program, and the European Union’s H2020 Fast Track to Innovation project SmartRCS (no. 971619).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Fabian Windbacher .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Windbacher, F., Hödlmoser, M., Gelautz, M. (2023). Single-Stage 3D Pose Estimation of Vulnerable Road Users Using Pseudo-Labels. In: Gade, R., Felsberg, M., Kämäräinen, JK. (eds) Image Analysis. SCIA 2023. Lecture Notes in Computer Science, vol 13886. Springer, Cham. https://doi.org/10.1007/978-3-031-31438-4_27

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-31438-4_27

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-31437-7

  • Online ISBN: 978-3-031-31438-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Navigation