Log in

Bird’s-Eye View Semantic Segmentation and Voxel Semantic Segmentation Based on Frustum Voxel Modeling and Monocular Camera

基于锥型体素建模和单目相机的鸟瞰图语义分割和体素语义分割

  • Original Paper
  • Published:
Journal of Shanghai Jiaotong University (Science) Aims and scope Submit manuscript

Abstract

The semantic segmentation of a bird’s-eye view (BEV) is crucial for environment perception in autonomous driving, which includes the static elements of the scene, such as drivable areas, and dynamic elements such as cars. This paper proposes an end-to-end deep learning architecture based on 3D convolution to predict the semantic segmentation of a BEV, as well as voxel semantic segmentation, from monocular images. The voxelization of scenes and feature transformation from the perspective space to camera space are the key approaches of this model to boost the prediction accuracy. The effectiveness of the proposed method was demonstrated by training and evaluating the model on the NuScenes dataset. A comparison with other state-of-the-art methods showed that the proposed approach outperformed other approaches in the semantic segmentation of a BEV. It also implements voxel semantic segmentation, which cannot be achieved by the state-of-the-art methods.

摘要

自动驾驶场景中包含静态目标,如可驾驶区域,以及动态目标,如汽车, 而鸟瞰图的语义分割对于自主驾驶中的环境感知至关重要。本文提出了一个基于三维卷积的端到端深度学**模型以单目相机作为输入并预测鸟瞰图的语义分割和体素语义分割。场景的体素化建模和透视空间到相机空间的特征转换是提高本模型预测准确性的的关键方法。本模型在NuScenes数据集上进行训练并评估该方法的有效性。与其他经典模型的对比结果表明本文提出的模型在鸟瞰图的语义分割方面优于其他算法。此外本文模型还实现了体素语义分割,而其他模型并不具备体素语义分割的能力。

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. BADRINARAYANAN V, KENDALL A, CIPOLLA R. SegNet: A deep convolutional encoder-decoder architecture for image segmentation [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(12): 2481–2495.

    Article  Google Scholar 

  2. READING C, HARAKEH A, CHAE J L, et al. Categorical depth distribution network for monocular 3D object detection [C]//2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville: IEEE, 2021: 8551–8560.

    Google Scholar 

  3. ABBAS S A, ZISSERMAN A. A geometric approach to obtain a bird’s eye view from an image [C]//2019 IEEE/CVF International Conference on Computer Vision Workshop. Seoul: IEEE, 2019: 4095–4104.

    Google Scholar 

  4. LIN C C, WANG M S. A vision based top-view transformation model for a vehicle parking assistant [J]. Sensors, 2012, 12(4): 4431–4446.

    Article  Google Scholar 

  5. DENG L Y, YANG M, LI H, et al. Restricted deformable convolution-based road scene semantic segmentation using surround view cameras [J]. IEEE Transactions on Intelligent Transportation Systems, 2020, 21(10): 4350–4362.

    Article  Google Scholar 

  6. SÄMANN T, AMENDE K, MILZ S, et al. Efficient semantic segmentation for visual bird’s-eye view interpretation [M]//Intelligent autonomous systems 15. Cham: Springer, 2018: 679–688.

    Google Scholar 

  7. PAN B W, SUN J K, LEUNG H Y T, et al. Cross-view semantic segmentation for sensing surroundings [J]. IEEE Robotics and Automation Letters, 2020, 5(3): 4867–4873.

    Article  Google Scholar 

  8. LU C Y, VAN DE MOLENGRAFT M J G, DUBBELMAN G. Monocular semantic occupancy grid map** with convolutional variational encoderdecoder networks [J]. IEEE Robotics and Automation Letters, 2019, 4(2): 445–452.

    Article  Google Scholar 

  9. SCHULTER S, ZHAI M H, JACOBS N, et al. Learning to look around objects for top-view representations of outdoor scenes [M]//Computer vision — ECCV 2018. Cham: Springer, 2018: 815–831.

    Chapter  Google Scholar 

  10. MANI K, DAGA S, GARG S, et al. MonoLayout: Amodal scene layout from a single image [C]//2020 IEEE Winter Conference on Applications of Computer Vision. Snowmass: IEEE, 2020: 1678–1686.

    Google Scholar 

  11. RODDICK T, CIPOLLA R. Predicting semantic map representations from images using pyramid occupancy networks [C]//2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle: IEEE, 2020: 11135–11144.

    Google Scholar 

  12. RONNEBERGER O, FISCHER P, BROX T. U-Net: Convolutional networks for biomedical image segmentation [M]//Medical image computing and computerassisted intervention — MICCAI 2015. Cham: Springer, 2015: 234–241.

    Chapter  Google Scholar 

  13. DING X H, ZHANG X Y, MA N N, et al. RepVGG: making VGG-style ConvNets great again [C]//2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville: IEEE, 2021: 13728–13737.

    Google Scholar 

  14. LIN T Y, GOYAL P, GIRSHICK R, et al. Focal loss for dense object detection [C]//2017 IEEE International Conference on Computer Vision. Venice: IEEE, 2017: 2999–3007.

    Google Scholar 

  15. CAESAR H, BANKITI V, LANG A H, et al. nuScenes: A multimodal dataset for autonomous driving [C]//2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle: IEEE, 2020: 11618–11628.

    Google Scholar 

  16. KINGMA D P, BA J. Adam: A method for stochastic optimization[DB/OL]. (2017-01-30). https://arxiv.org/abs/1412.6980.

  17. GARCIA-GARCIA A, ORTS-ESCOLANO S, OPREA S, et al. A review on deep learning techniques applied to semantic segmentation [DB/OL]. (2017-04-22). https://arxiv.org/abs/1704.06857.

Download references

Funding

Foundation item: the National Natural Science Foundation of China (No. 52072243), and the Sichuan Science and Technology Program (No. 2020YFSY0058)

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Chengliang Yin  (殷承良).

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Qin, C., Wang, Y., Zhang, Y. et al. Bird’s-Eye View Semantic Segmentation and Voxel Semantic Segmentation Based on Frustum Voxel Modeling and Monocular Camera. J. Shanghai Jiaotong Univ. (Sci.) 28, 100–113 (2023). https://doi.org/10.1007/s12204-023-2573-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s12204-023-2573-3

Keywords

关键词

CLC number

Document code

Navigation