Atlas: End-to-End 3D Scene Reconstruction from Posed Images

Murez, Zak; van As, Tarrence; Bartolozzi, James; Sinha, Ayan; Badrinarayanan, Vijay; Rabinovich, Andrew

doi:10.1007/978-3-030-58571-6_25

Zak Murez¹²,
Tarrence van As¹³,
James Bartolozzi¹²,
Ayan Sinha¹²,
Vijay Badrinarayanan¹⁴ &
…
Andrew Rabinovich¹³

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 12352))

Included in the following conference series:

European Conference on Computer Vision

4997 Accesses

Abstract

We present an end-to-end 3D reconstruction method for a scene by directly regressing a truncated signed distance function (TSDF) from a set of posed RGB images. Traditional approaches to 3D reconstruction rely on an intermediate representation of depth maps prior to estimating a full 3D model of a scene. We hypothesize that a direct regression to 3D is more effective. A 2D CNN extracts features from each image independently which are then back-projected and accumulated into a voxel volume using the camera intrinsics and extrinsics. After accumulation, a 3D CNN refines the accumulated features and predicts the TSDF values. Additionally, semantic segmentation of the 3D model is obtained without significant computation. This approach is evaluated on the Scannet dataset where we significantly outperform state-of-the-art baselines (deep multiview stereo followed by traditional TSDF fusion) both quantitatively and qualitatively. We compare our 3D semantic segmentation to prior methods that use a depth sensor since no previous work attempts the problem with only RGB input.

T. van As, J. Bartolozzi and V. Badrinarayanan—Work done at Magic Leap.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Chapter: EUR 29.95; Price includes VAT (Germany)

eBook: EUR 85.59; Price includes VAT (Germany)

Softcover Book: EUR 106.99; Price includes VAT (Germany)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Image-to-Voxel Model Translation with Conditional Adversarial Networks

3DMV: Joint 3D-Multi-view Prediction for 3D Semantic Scene Segmentation

SimpleRecon: 3D Reconstruction Without 3D Convolutions

References

Badrinarayanan, V., Kendall, A., Cipolla, R.: Segnet: a deep convolutional encoder-decoder architecture for image segmentation (2015)
Google Scholar
Chabra, R., et al.: Deep local shapes: learning local SDF priors for detailed 3D reconstruction. ar**v preprint ar**v:2003.10983 (2020)
Chabra, R., Straub, J., Sweeney, C., Newcombe, R., Fuchs, H.: Stereodrnet: dilated residual stereonet. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 11786–11795 (2019)
Google Scholar
Chang, A.X., et al.: Shapenet: an information-rich 3D model repository. ar**v preprint ar**v:1512.03012 (2015)
Chang, J.R., Chen, Y.S.: Pyramid stereo matching network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5410–5418 (2018)
Google Scholar
Chen, R., Han, S., Xu, J., Su, H.: Point-based multi-view stereo network. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1538–1547 (2019)
Google Scholar
Cheng, B., et al.: Panoptic-deeplab. ar**v preprint ar**v:1910.04751 (2019)
Choy, C., Gwak, J., Savarese, S.: 4D spatio-temporal convnets: minkowski convolutional neural networks (2019)
Google Scholar
Choy, C.B., Xu, D., Gwak, J.Y., Chen, K., Savarese, S.: 3D-R2N2: a unified approach for single and multi-view 3D object reconstruction. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 628–644. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46484-8_38
Chapter Google Scholar
Curless, B., Levoy, M.: A volumetric method for building complex models from range images. In: Proceedings of the 23rd Annual Conference on Computer Graphics and Interactive Techniques, pp. 303–312 (1996)
Google Scholar
Dai, A., Chang, A.X., Savva, M., Halber, M., Funkhouser, T., Nießner, M.: Scannet: richly-annotated 3D reconstructions of indoor scenes. In: Proceedings of Computer Vision and Pattern Recognition (CVPR). IEEE (2017)
Google Scholar
Dai, A., Diller, C., Nießner, M.: SG-NN: sparse generative neural networks for self-supervised scene completion of RGB-D scans. ar**v preprint ar**v:1912.00036 (2019)
Dai, A., Nießner, M.: 3DMV: joint 3D-multi-view prediction for 3D semantic scene segmentation (2018)
Google Scholar
Dai, A., Qi, C.R., Nießner, M.: Shape completion using 3D-encoder-predictor CNNs and shape synthesis (2016)
Google Scholar
Dai, A., Ritchie, D., Bokeloh, M., Reed, S., Sturm, J., Nießner, M.: Scancomplete: large-scale scene completion and semantic segmentation for 3D scans. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4578–4587 (2018)
Google Scholar
Falcon, W.: Pytorch lightning. GitHub. https://github.com/PyTorchLightning/pytorch-lightning Cited by 3 (2019)
Fan, H., Su, H., Guibas, L.J.: A point set generation network for 3D object reconstruction from a single image. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 605–613 (2017)
Google Scholar
Fu, H., Gong, M., Wang, C., Batmanghelich, K., Tao, D.: Deep ordinal regression network for monocular depth estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2002–2011 (2018)
Google Scholar
Galliani, S., Lasinger, K., Schindler, K.: Massively parallel multiview stereopsis by surface normal diffusion. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 873–881 (2015)
Google Scholar
Gkioxari, G., Malik, J., Johnson, J.: Mesh R-CNN. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 9785–9795 (2019)
Google Scholar
Graham, B., Engelcke, M., van der Maaten, L.: 3D semantic segmentation with submanifold sparse convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9224–9232 (2018)
Google Scholar
He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2961–2969 (2017)
Google Scholar
Hirschmuller, H.: Stereo processing by semiglobal matching and mutual information. IEEE Trans. Pattern Anal. Mach. Intell. 30(2), 328–341 (2007)
Article Google Scholar
Hou, J., Dai, A., Nießner, M.: 3D-SIC: 3D semantic instance completion for RGB-D scans. ar**v preprint ar**v:1904.12012 (2019)
Hou, J., Dai, A., Nießner, M.: 3D-SIS: 3D semantic instance segmentation of RGB-D scans (2018)
Google Scholar
Hou, Y., Kannala, J., Solin, A.: Multi-view stereo by temporal nonparametric fusion. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2651–2660 (2019)
Google Scholar
Huang, P.H., Matzen, K., Kopf, J., Ahuja, N., Huang, J.B.: Deepmvs: learning multi-view stereopsis. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2821–2830 (2018)
Google Scholar
Im, S., Jeon, H.G., Lin, S., Kweon, I.S.: Dpsnet: end-to-end deep plane sweep stereo. In: 7th International Conference on Learning Representations, ICLR 2019. International Conference on Learning Representations, ICLR (2019)
Google Scholar
Iskakov, K., Burkov, E., Lempitsky, V., Malkov, Y.: Learnable triangulation of human pose. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 7718–7727 (2019)
Google Scholar
Kirillov, A., Girshick, R., He, K., Dollár, P.: Panoptic feature pyramid networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6399–6408 (2019)
Google Scholar
Lasinger, K., Ranftl, R., Schindler, K., Koltun, V.: Towards robust monocular depth estimation: mixing datasets for zero-shot cross-dataset transfer. ar**v preprint ar**v:1907.01341 (2019)
Lee, J.H., Han, M.K., Ko, D.W., Suh, I.H.: From big to small: multi-scale local planar guidance for monocular depth estimation. ar**v preprint ar**v:1907.10326 (2019)
Lin, T.Y., Dollár, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature pyramid networks for object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2117–2125 (2017)
Google Scholar
Lorensen, W.E., Cline, H.E.: Marching cubes: a high resolution 3D surface construction algorithm. ACM Siggraph Comput. Graph. 21(4), 163–169 (1987)
Article Google Scholar
McCormac, J., Clark, R., Bloesch, M., Davison, A., Leutenegger, S.: Fusion++: volumetric object-level slam. In: 2018 International Conference on 3D Vision (3DV), pp. 32–41. IEEE (2018)
Google Scholar
McCormac, J., Handa, A., Davison, A., Leutenegger, S.: Semanticfusion: dense 3D semantic map** with convolutional neural networks. In: 2017 IEEE International Conference on Robotics and Automation (ICRA), pp. 4628–4635. IEEE (2017)
Google Scholar
Narita, G., Seno, T., Ishikawa, T., Kaji, Y.: Panopticfusion: online volumetric semantic map** at the level of stuff and things. ar**v preprint ar**v:1903.01177 (2019)
Park, J.J., Florence, P., Straub, J., Newcombe, R., Lovegrove, S.: Deepsdf: learning continuous signed distance functions for shape representation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 165–174 (2019)
Google Scholar
Qi, C.R., Yi, L., Su, H., Guibas, L.J.: Pointnet++: deep hierarchical feature learning on point sets in a metric space (2017)
Google Scholar
Riegler, G., Osman Ulusoy, A., Geiger, A.: Octnet: learning deep 3D representations at high resolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3577–3586 (2017)
Google Scholar
Rosinol, A., Abate, M., Chang, Y., Carlone, L.: Kimera: an open-source library for real-time metric-semantic localization and map**. In: IEEE International Conference on Robotics and Automation (ICRA) (2020)
Google Scholar
Schönberger, J.L., Frahm, J.M.: Structure-from-motion revisited. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2016)
Google Scholar
Schönberger, J.L., Zheng, E., Pollefeys, M., Frahm, J.M.: Pixelwise view selection for unstructured multi-view stereo. In: European Conference on Computer Vision (ECCV) (2016)
Google Scholar
Schöps, T., Sattler, T., Pollefeys, M.: Surfelmeshing: online surfel-based mesh reconstruction. IEEE Trans. Pattern Anal. Mach. Intell. (2019)
Google Scholar
Sinha, A., Unmesh, A., Huang, Q., Ramani, K.: Surfnet: generating 3D shape surfaces using deep residual networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6040–6049 (2017)
Google Scholar
Sitzmann, V., Thies, J., Heide, F., Nießner, M., Wetzstein, G., Zollhofer, M.: Deepvoxels: learning persistent 3D feature embeddings. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2437–2446 (2019)
Google Scholar
Sitzmann, V., Zollhöfer, M., Wetzstein, G.: Scene representation networks: continuous 3D-structure-aware neural scene representations. In: Advances in Neural Information Processing Systems (2019)
Google Scholar
Su, H., et al.: Splatnet: sparse lattice networks for point cloud processing (2018)
Google Scholar
Tatarchenko, M., Dosovitskiy, A., Brox, T.: Octree generating networks: efficient convolutional architectures for high-resolution 3D outputs. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2088–2096 (2017)
Google Scholar
Tateno, K., Tombari, F., Laina, I., Navab, N.: CNN-slam: real-time dense monocular slam with learned depth prediction. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6243–6252 (2017)
Google Scholar
Wang, K., Shen, S.: Mvdepthnet: real-time multiview depth estimation neural network. In: 2018 International Conference on 3D Vision (3DV), pp. 248–257. IEEE (2018)
Google Scholar
Wang, N., Zhang, Y., Li, Z., Fu, Y., Liu, W., Jiang, Y.G.: Pixel2Mesh: generating 3D mesh models from single RGB images. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 52–67 (2018)
Google Scholar
Wang, W., Yu, R., Huang, Q., Neumann, U.: SGPN: similarity group proposal network for 3D point cloud instance segmentation. In: In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2569–2578 (2018)
Google Scholar
Weder, S., Schönberger, J., Pollefeys, M., Oswald, M.R.: Routedfusion: learning real-time depth map fusion. ar**v preprint ar**v:2001.04388 (2020)
Whelan, T., Leutenegger, S., Salas-Moreno, R., Glocker, B., Davison, A.: Elasticfusion: Dense slam without a pose graph. Robotics: Science and Systems (2015)
Google Scholar
**e, H., Yao, H., Sun, X., Zhou, S., Zhang, S.: Pix2Vox: context-aware 3D reconstruction from single and multi-view images. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2690–2698 (2019)
Google Scholar
Yao, Y., Luo, Z., Li, S., Fang, T., Quan, L.: Mvsnet: depth inference for unstructured multi-view stereo. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 767–783 (2018)
Google Scholar
Yao, Y., Luo, Z., Li, S., Shen, T., Fang, T., Quan, L.: Recurrent MVSNet for high-resolution multi-view stereo depth inference. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5525–5534 (2019)
Google Scholar
Zimmermann, C., Ceylan, D., Yang, J., Russell, B., Argus, M., Brox, T.: FreiHAND: a dataset for markerless capture of hand pose and shape from single RGB images. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 813–822 (2019)
Google Scholar

Download references

Author information

Authors and Affiliations

Magic Leap Inc., Sunnyvale, CA, USA
Zak Murez, James Bartolozzi & Ayan Sinha
InsideIQ Inc., San Francisco, CA, USA
Tarrence van As & Andrew Rabinovich
Wayve.ai, London, UK
Vijay Badrinarayanan

Authors

Zak Murez
View author publications
You can also search for this author in PubMed Google Scholar
Tarrence van As
View author publications
You can also search for this author in PubMed Google Scholar
James Bartolozzi
View author publications
You can also search for this author in PubMed Google Scholar
Ayan Sinha
View author publications
You can also search for this author in PubMed Google Scholar
Vijay Badrinarayanan
View author publications
You can also search for this author in PubMed Google Scholar
Andrew Rabinovich
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zak Murez .

Editor information

Editors and Affiliations

University of Oxford, Oxford, UK
Andrea Vedaldi
Graz University of Technology, Graz, Austria
Horst Bischof
University of Freiburg, Freiburg im Breisgau, Germany
Thomas Brox
University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
Jan-Michael Frahm

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (m4v 84704 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Murez, Z., van As, T., Bartolozzi, J., Sinha, A., Badrinarayanan, V., Rabinovich, A. (2020). Atlas: End-to-End 3D Scene Reconstruction from Posed Images. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, JM. (eds) Computer Vision – ECCV 2020. ECCV 2020. Lecture Notes in Computer Science(), vol 12352. Springer, Cham. https://doi.org/10.1007/978-3-030-58571-6_25

Download citation

DOI: https://doi.org/10.1007/978-3-030-58571-6_25
Published: 09 November 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-58570-9
Online ISBN: 978-3-030-58571-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Atlas: End-to-End 3D Scene Reconstruction from Posed Images

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Image-to-Voxel Model Translation with Conditional Adversarial Networks

3DMV: Joint 3D-Multi-view Prediction for 3D Semantic Scene Segmentation

SimpleRecon: 3D Reconstruction Without 3D Convolutions

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Atlas: End-to-End 3D Scene Reconstruction from Posed Images

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Image-to-Voxel Model Translation with Conditional Adversarial Networks

3DMV: Joint 3D-Multi-view Prediction for 3D Semantic Scene Segmentation

SimpleRecon: 3D Reconstruction Without 3D Convolutions

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation