Learning Ego 3D Representation as Ray Tracing

Lu, Jiachen; Zhou, Zheyuan; Zhu, **atian; Xu, Hang; Zhang, Li

doi:10.1007/978-3-031-19809-0_8

Jiachen Lu¹²,
Zheyuan Zhou¹²,
**atian Zhu¹³,
Hang Xu¹⁴ &
…
Li Zhang¹²

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13686))

Included in the following conference series:

European Conference on Computer Vision

2819 Accesses
14 Citations

Abstract

A self-driving perception model aims to extract 3D semantic representations from multiple cameras collectively into the bird’s-eye-view (BEV) coordinate frame of the ego car in order to ground downstream planner. Existing perception methods often rely on error-prone depth estimation of the whole scene or learning sparse virtual 3D representations without the target geometry structure, both of which remain limited in performance and/or capability. In this paper, we present a novel end-to-end architecture for ego 3D representation learning from an arbitrary number of unconstrained camera views. Inspired by the ray tracing principle, we design a polarized grid of “imaginary eyes” as the learnable ego 3D representation and formulate the learning process with the adaptive attention mechanism in conjunction with the 3D-to-2D projection. Critically, this formulation allows extracting rich 3D representation from 2D images without any depth supervision, and with the built-in geometry structure consistent w.r.t BEV. Despite its simplicity and versatility, extensive experiments on standard BEV visual tasks (e.g., camera-based 3D object detection and BEV segmentation) show that our model outperforms all state-of-the-art alternatives significantly, with an extra advantage in computational efficiency from multi-task learning.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Lift, Splat, Shoot: Encoding Images from Arbitrary Camera Rigs by Implicitly Unprojecting to 3D

GitNet: Geometric Prior-Based Transformation for Birds-Eye-View Segmentation

Improved deep depth estimation for environments with sparse visual cues

Article Open access 04 January 2023

References

Caesar, H., et al.: nuScenes: a multimodal dataset for autonomous driving. In: CVPR (2020)
Google Scholar
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 213–229. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_13
Chapter Google Scholar
Chen, L.C., Papandreou, G., Schroff, F., Adam, H.: Rethinking atrous convolution for semantic image segmentation. ar**v preprint (2017)
Google Scholar
Chitta, K., Prakash, A., Geiger, A.: Neat: neural attention fields for end-to-end autonomous driving. In: ICCV (2021)
Google Scholar
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: CVPR (2009)
Google Scholar
Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. In: ICLR (2021)
Google Scholar
Dwivedi, I., Malla, S., Chen, Y.T., Dariush, B.: Bird’s eye view segmentation using lifted 2d semantic features. In: BMVC (2021)
Google Scholar
Everingham, M., Van Gool, L., Williams, C.K., Winn, J., Zisserman, A.: The pascal visual object classes (voc) challenge. In: IJCV (2010)
Google Scholar
Guizilini, V., Ambrus, R., Pillai, S., Raventos, A., Gaidon, A.: 3d packing for self-supervised monocular depth estimation. In: CVPR (2020)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)
Google Scholar
Hu, A., et al.: Fiery: future instance prediction in bird’s-eye view from surround monocular cameras. In: CVPR (2021)
Google Scholar
Lang, A.H., Vora, S., Caesar, H., Zhou, L., Yang, J., Beijbom, O.: Pointpillars: fast encoders for object detection from point clouds. In: CVPR (2019)
Google Scholar
Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR (2015)
Google Scholar
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: ICLR (2019)
Google Scholar
Lu, C., van de Molengraft, M.J.G., Dubbelman, G.: Monocular semantic occupancy grid map** with convolutional variational encoder-decoder networks. IEEE Robot. Autom. Lett. 4(2), 445–452 (2019)
Google Scholar
Ma, X., Wang, Z., Li, H., Zhang, P., Ouyang, W., Fan, X.: Accurate monocular 3D object detection via color-embedded 3D reconstruction for autonomous driving. In: ICCV (2019)
Google Scholar
Misra, I., Girdhar, R., Joulin, A.: An end-to-end transformer model for 3D object detection. In: ICCV (2021)
Google Scholar
Pan, B., Sun, J., Leung, H.Y.T., Andonian, A., Zhou, B.: Cross-view semantic segmentation for sensing surroundings. IEEE Robotics and Automation Letters (2020)
Google Scholar
Park, D., Ambrus, R., Guizilini, V., Li, J., Gaidon, A.: Is pseudo-lidar needed for monocular 3D object detection? In: ICCV (2021)
Google Scholar
Philion, J., Fidler, S.: Lift, splat, shoot: encoding images from arbitrary camera rigs by implicitly unprojecting to 3D. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12359, pp. 194–210. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58568-6_12
Chapter Google Scholar
Reading, C., Harakeh, A., Chae, J., Waslander, S.L.: Categorical depth distribution network for monocular 3D object detection. In: CVPR (2021)
Google Scholar
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: NeurIPS (2015)
Google Scholar
Roddick, T., Cipolla, R.: Predicting semantic map representations from images using pyramid occupancy networks. In: CVPR (2020)
Google Scholar
Roddick, T., Kendall, A., Cipolla, R.: Orthographic feature transform for monocular 3D object detection. In: BMVC (2019)
Google Scholar
Saha, A., Maldonado, O.M., Russell, C., Bowden, R.: Translating images into maps. ar**v preprint ar**v:2110.00966 (2021)
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: ICLR (2015)
Google Scholar
Tan, M., Le, Q.: EfficientNet: rethinking model scaling for convolutional neural networks. In: ICML (2019)
Google Scholar
Wang, T., **nge, Z., Pang, J., Lin, D.: Probabilistic and geometric depth: detecting objects in perspective. In: Conference on Robot Learning (2022)
Google Scholar
Wang, T., Zhu, X., Pang, J., Lin, D.: Fcos3D: fully convolutional one-stage monocular 3D object detection. In: CVPR (2021)
Google Scholar
Wang, W., **e, E., Li, X., Fan, D.P., Song, K., Liang, D., Lu, T., Luo, P., Shao, L.: PVT v2: improved baselines with pyramid vision transformer. In: Computational Visual Media, pp. 415–424 (2022)
Google Scholar
Wang, Y., Chao, W.L., Garg, D., Hariharan, B., Campbell, M., Weinberger, K.Q.: Pseudo-lidar from visual depth estimation: bridging the gap in 3D object detection for autonomous driving. In: CVPR (2019)
Google Scholar
Wang, Y., Guizilini, V.C., Zhang, T., Wang, Y., Zhao, H., Solomon, J.: Detr3D: 3D object detection from multi-view images via 3D-to-2D queries. In: Conference on Robot Learning (2022)
Google Scholar
Whitted, T.: An improved illumination model for shaded display. In: ACM Siggraph 2005 Courses (2005)
Google Scholar
Yan, Y., Mao, Y., Li, B.: Second: sparsely embedded convolutional detection. Sensors (2018)
Google Scholar
Yang, W., Li, Q., Liu, W., Yu, Y., Ma, Y., He, S., Pan, J.: Projecting your view attentively: Monocular road scene layout estimation via cross-view transformation. In: CVPR (2021)
Google Scholar
Yin, T., Zhou, X., Krahenbuhl, P.: Center-based 3D object detection and tracking. In: CVPR (2021)
Google Scholar
You, Y., et al.: Pseudo-LiDAR++: accurate depth for 3D object detection in autonomous driving. In: ICLR (2019)
Google Scholar
Zheng, S., et al.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR (2021)
Google Scholar
Zhou, X., Wang, D., Krähenbühl, P.: Objects as points. ar** and sampling for point cloud 3D object detection. ar**%20and%20sampling%20for%20point%20cloud%203D%20object%20detection.%20ar**v%20preprint%20%282019%29"> Google Scholar
Zhu, X., Su, W., Lu, L., Li, B., Wang, X., Dai, J.: Deformable DETR: deformable transformers for end-to-end object detection. In: ICLR (2021)
Google Scholar

Download references

Acknowledgements

This work was supported in part by National Natural Science Foundation of China (Grant No. 6210020439), Lingang Laboratory (Grant No. LG-QS-202202–07), Natural Science Foundation of Shanghai (Grant No. 22ZR1407500).

Author information

Authors and Affiliations

School of Data Science, Fudan University, Shanghai, China
Jiachen Lu, Zheyuan Zhou & Li Zhang
University of Surrey, Guildford, UK
**atian Zhu
Huawei Noah’s Ark Lab, Shanghai, Canada
Hang Xu

Authors

Jiachen Lu
View author publications
You can also search for this author in PubMed Google Scholar
Zheyuan Zhou
View author publications
You can also search for this author in PubMed Google Scholar
**atian Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Hang Xu
View author publications
You can also search for this author in PubMed Google Scholar
Li Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Li Zhang .

Editor information

Editors and Affiliations

Tel Aviv University, Tel Aviv, Israel
Shai Avidan
University College London, London, UK
Gabriel Brostow
Google AI, Accra, Ghana
Moustapha Cissé
University of Catania, Catania, Italy
Giovanni Maria Farinella
Facebook (United States), Menlo Park, CA, USA
Tal Hassner

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 6322 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Lu, J., Zhou, Z., Zhu, X., Xu, H., Zhang, L. (2022). Learning Ego 3D Representation as Ray Tracing. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13686. Springer, Cham. https://doi.org/10.1007/978-3-031-19809-0_8

Download citation

DOI: https://doi.org/10.1007/978-3-031-19809-0_8
Published: 01 November 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-19808-3
Online ISBN: 978-3-031-19809-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Learning Ego 3D Representation as Ray Tracing

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Lift, Splat, Shoot: Encoding Images from Arbitrary Camera Rigs by Implicitly Unprojecting to 3D

GitNet: Geometric Prior-Based Transformation for Birds-Eye-View Segmentation

Improved deep depth estimation for environments with sparse visual cues

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 6322 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Learning Ego 3D Representation as Ray Tracing

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Lift, Splat, Shoot: Encoding Images from Arbitrary Camera Rigs by Implicitly Unprojecting to 3D

GitNet: Geometric Prior-Based Transformation for Birds-Eye-View Segmentation

Improved deep depth estimation for environments with sparse visual cues

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 6322 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation