Towards Robust Monocular Depth Estimation: A New Baseline and Benchmark

**an, Ke; Cao, Zhiguo; Shen, Chunhua; Lin, Guosheng

doi:10.1007/s11263-023-01979-4

Towards Robust Monocular Depth Estimation: A New Baseline and Benchmark

Published: 20 January 2024

Volume 132, pages 2401–2419, (2024)
Cite this article

International Journal of Computer Vision Aims and scope Submit manuscript

Ke **an^1,2,
Zhiguo Cao³,
Chunhua Shen⁴ &
…
Guosheng Lin ORCID: orcid.org/0000-0002-0329-7458²

986 Accesses
2 Altmetric
Explore all metrics

Abstract

Before deploying a monocular depth estimation (MDE) model in real-world applications such as autonomous driving, it is critical to understand its generalization and robustness. Although the generalization of MDE models has been thoroughly studied, the robustness of the models has been overlooked in previous research. Existing state-of-the-art methods exhibit strong generalization to clean, unseen scenes. Such methods, however, appear to degrade when the test image is perturbed. This is likely because the prior arts typically use the primary 2D data augmentations (e.g., random horizontal flip**, random crop**, and color jittering), ignoring other common image degradation or corruptions. To mitigate this issue, we delve deeper into data augmentation and propose utilizing strong data augmentation techniques for robust depth estimation. In particular, we introduce 3D-aware defocus blur in addition to seven 2D data augmentations. We evaluate the generalization of our model on six clean RGB-D datasets that were not seen during training. To evaluate the robustness of MDE models, we create a benchmark by applying 15 common corruptions to the clean images from IBIMS, NYUDv2, KITTI, ETH3D, DIODE, and TUM. On this benchmark, we systematically study the robustness of our method and 9 representative MDE models. The experimental results demonstrate that our model exhibits better generalization and robustness than the previous methods. Specifically, we provide valuable insights about the choices of data augmentation strategies and network architectures, which would be useful for future research in robust monocular depth estimation. Our code, model, and benchmark can be available at https://github.com/KexianHust/Robust-MonoDepth.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A survey on Image Data Augmentation for Deep Learning

Article Open access 06 July 2019

Deep learning models for digital image processing: a review

Article 07 January 2024

Image Matching from Handcrafted to Deep Features: A Survey

Article Open access 04 August 2020

Data Availability

The data that supports our findings are all publicly available online:1. HRWSI (**an et al., 2020): https://kexianhust.github.io/Structure-Guided-Ranking-Loss/. 2. 3DKenBurns (Niklaus et al., 2019): https://github.com/sniklaus/3d-ken-burns. 3. DrivingStereo (Yang et al., 2019): https://drivingstereo-dataset.github.io/. 4. MegaDepth (Li & Snavely, 2018): https://www.cs.cornell.edu/projects/megadepth/. 5. TartanAir (Wang et al., 2020): https://theairlab.org/tartanair-dataset/. 6. Taskonomy (Zamir et al., 2018): http://taskonomy.stanford.edu/. 7. Hypersim (Roberts et al., 2021): https://github.com/apple/ml-hypersim. 8. IRS (Wang et al., 2019): https://github.com/HKBU-HPML/IRS. 9. IBIMS (Koch et al., 2018): https://www.asg.ed.tum.de/lmf/ibims1/. 10. NYUDv2 (Silberman et al., 2012): https://cs.nyu.edu/~silberman/datasets/nyu_depth_v2.html. 11. KITTI (Uhrig et al., 2017): https://www.cvlibs.net/datasets/kitti/index.php. 12. ETH3D (Schöps et al., 2017): https://www.eth3d.net/datasets. 13. DIODE (Vasiljevic et al., 2019): https://diode-dataset.org/. 14. TUM (Sturm et al., 2012): https://cvg.cit.tum.de/data/datasets/rgbd-dataset. 15. OASIS (Chen et al., 2020): https://oasis.cs.princeton.edu/download.

Notes

https://kornia.readthedocs.io/en/latest/augmentation.module.html

References

Bian, J.-W., Zhan, H., Wang, N., Li, Z., Zhang, L., Shen, C., et al. (2021). Unsupervised scale-consistent depth learning from video. IJCV, 129(9), 2548–2564.
Article Google Scholar
Chen, W., Fu, Z., Yang, D., & Deng, J. (2016). Single-image depth perception in the wild. In NeurIPS. (pp. 730–738).
Chen, W., Qian, S., & Deng, J. (2019). Learning single-image depth from videos using quality assessment networks. In CVPR. (pp. 5604–5613).
Chen, W., Qian, S., Fan, D., Kojima, N., Hamilton, M., & Deng, J. (2020). Oasis: A large-scale dataset for single image 3d in the wild. In CVPR. (pp. 679–688).
Cubuk, E. D., Zoph, B., Mane, D., Vasudevan, V., & Le, Q. V. (2019). Autoaugment: Learning augmentation strategies from data. In CVPR. (pp. 113–123).
DeVries, T., & Taylor, G. W. (2017). Improved regularization of convolutional neural networks with cutout. ar**v preprint ar**v:1708.04552.
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., et al. (2020). An image is worth 16x16 words: Transformers for image recognition at scale. ar**v preprintar**v:2010.11929.
Eigen, D., & Fergus, R. (2015). Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In ICCV. (pp. 2650–2658).
Eigen, D., Puhrsch, C., & Fergus, R. (2014). Depth map prediction from a single image using a multi-scale deep network. In NeurIPS, volume 27. (pp. 1–9).
Fu, H., Gong, M., Wang, C., Batmanghelich, K., & Tao, D. (2018). Deep ordinal regression network for monocular depth estimation. In CVPR. (pp. 2002–2011).
Godard, C., Mac Aodha, O., & Brostow, G. J. (2017). Unsupervised monocular depth estimation with left-right consistency. In CVPR. (pp. 270–279).
Godard, C., Mac Aodha, O., Firman, M., & Brostow, G. J. (2019). Digging into self-supervised monocular depth estimation. In CVPR. (pp. 3828–3838).
Hendrycks, D., & Dietterich, T. (2019). Benchmarking neural network robustness to common corruptions and perturbations. ar**v preprintar**v:1903.12261.
Kamann, C., & Rother, C. (2021). Benchmarking the robustness of semantic segmentation models with respect to common corruptions. IJCV, 129, 462–483.
Article Google Scholar
Kar, O. F., Yeo, T., Atanov, A., & Zamir, A. (2022). 3d common corruptions and data augmentation. In CVPR. (pp. 18963–18974).
Koch, T., Liebel, L., Fraundorfer, F., & Körner, M. (2018). Evaluation of CNN-based single-image depth estimation methods. In ECCVW. (pp. 331–348).
Laina, I., Rupprecht, C., Belagiannis, V., Tombari, F., & Navab, N. (2016). Deeper depth prediction with fully convolutional residual networks. In Proceeding of IEEE International Conference 3D Vision. (pp. 239–248).
Lasinger, K., Ranftl, R., Schindler, K., & Koltun, V. (2020). Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. IEEE TPAMI, 44(3), 1623–1637.
Google Scholar
Lee, H., & Park, J. (2022). Instance-wise occlusion and depth orders in natural scenes. In CVPR. (pp. 21210–21221).
Lee, S., Rameau, F., Im, S., & Kweon, I. S. (2022). Self-supervised monocular depth and motion learning in dynamic scenes: Semantic prior to rescue. IJCV, 130(9), 2265–2285.
Article Google Scholar
Li, Z., Niklaus, S., Snavely, N., & Wang, O. (2021). Neural scene flow fields for space-time view synthesis of dynamic scenes. In CVPR. (pp. 6498–6508).
Li, Z., & Snavely, N. (2018). Megadepth: Learning single-view depth prediction from internet photos. In CVPR. (pp. 2041–2050).
Niklaus, S., Mai, L., Yang, J., & Liu, F. (2019). 3D Ken burns effect from a single image. ACM TOG, 38(6), 1841–18415.
Article Google Scholar
Peng, J., Cao, Z., Luo, X., Lu, H., **an, K., & Zhang, J. (2022). Bokehme: When neural rendering meets classical rendering. In CVPR. (pp. 16283–16292).
Ranftl, R., Bochkovskiy, A., & Koltun, V. (2021). Vision transformers for dense prediction. In ICCV. (pp. 12179–12188).
Roberts, M., Ramapuram, J., Ranjan, A., Kumar, A., Bautista, M. A., Paczan, N., et al. (2021). Hypersim: A photorealistic synthetic dataset for holistic indoor scene understanding. In ICCV. (pp. 10912–10922).
Saleh, B. E., & Teich, M. C. (2019). Fundamentals of photonics. London: Wiley.
Google Scholar
Schöps, T., Schönberger, J. L., Galliani, S., Sattler, T., Schindler, K., Pollefeys, M., et al. (2017). A multi-view stereo benchmark with high-resolution images and multi-camera videos. In CVPR. (pp. 3260–3269).
Silberman, N., Hoiem, D., Kohli, P., & Fergus, R. (2012). Indoor segmentation and support inference from rgbd images. In ECCV. (pp. 746–760).
Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. ar**v preprintar**v:1409.1556.
Sturm, J., Engelhard, N., Endres, F., Burgard, W., & Cremers, D. (2012). A benchmark for the evaluation of rgb-d slam systems. In IROS. (pp. 573–580).
Teed, Z., & Deng, J. (2020). Raft: Recurrent all-pairs field transforms for optical flow. In ECCV. (pp. 402–419).
Uhrig, J., Schneider, N., Schneider, L., Franke, U., Brox, T., & Geiger, A. (2017). Sparsity invariant CNNS. In Proceeding of IEEE International Conference of 3D Vision. (pp. 11–20).
Vasiljevic, I., Kolkin, N., Zhang, S., Luo, R., Wang, H., Dai, F. Z., et al. (1908). 2019 (p. 00463). DIODE: A dense indoor and outdoor depth dataset. arxiv.
Wadhwa, N., Garg, R., Jacobs, D. E., Feldman, B. E., Kanazawa, N., Carroll, R., et al. (2018). Synthetic depth-of-field with a single-camera mobile phone. ACM TOG, 37(4), 1–13.
Article Google Scholar
Wang, L., Shen, X., Zhang, J., Wang, O., Lin, Z., Hsieh, C.-Y., et al. (2018). Deeplens: Shallow depth of field from a single image. ACM TOG, 37(6), 1–11.
Google Scholar
Wang, Q., Li, Z., Salesin, D., Snavely, N., Curless, B., & Kontkanen, J. (2022). 3d moments from near-duplicate photos. In CVPR. (pp. 3906–3915).
Wang, Q., Zheng, S., Yan, Q., Deng, F., Zhao, K., & Chu, X. (2019). Irs: A large synthetic indoor robotics stereo dataset for disparity and surface normal estimation. ar**v preprintar**v:1912.09678, 6.
Wang, W., Zhu, D., Wang, X., Hu, Y., Qiu, Y., & Wang, C., et al. (2020). Tartanair: A dataset to push the limits of visual slam. In IROS. (pp. 4909–4916).
**an, K., Shen, C., Cao, Z., Lu, H., **ao, Y., & Li, R., et al. (2018). Monocular relative depth perception with web stereo data supervision. In CVPR. (pp. 311–320).
**an, K., Zhang, J., Wang, O., Mai, L., Lin, Z., & Cao, Z. (2020). Structure-guided ranking loss for single image depth prediction. In CVPR. (pp. 611–620).
Xu, D., Ricci, E., Ouyang, W., Wang, X., & Sebe, N. (2017). Multi-scale continuous crfs as sequential deep networks for monocular depth estimation. In CVPR. (pp. 5354–5362).
Yang, G., Song, X., Huang, C., Deng, Z., Shi, J., & Zhou, B. (2019). Drivingstereo: A large-scale dataset for stereo matching in autonomous driving scenarios. In CVPR. (pp. 899–908).
Yin, W., Liu, Y., & Shen, C. (2022). Virtual normal: Enforcing geometric constraints for accurate and robust depth prediction. IEEE TPAMI, 44(10), 7282–7295.
Article Google Scholar
Yin, W., Zhang, J., Wang, O., Niklaus, S., Mai, L., & Chen, S., et al. (2021). Learning to recover 3d scene shape from a single image. In CVPR. (pp. 204–213).
Yoon, J. S., Kim, K., Gallo, O., Park, H. S., & Kautz, J. (2020). Novel view synthesis of dynamic scenes with globally coherent depths from a monocular camera. In CVPR. (pp. 5336–5345).
Yuan, J., Liu, Y., Shen, C., Wang, Z., & Li, H. (2021). A simple baseline for semi-supervised semantic segmentation with strong data augmentation. In ICCV. (pp. 8229–8238).
Yun, S., Han, D., Oh, S. J., Chun, S., Choe, J., & Yoo, Y. (2019). Cutmix: Regularization strategy to train strong classifiers with localizable features. In ICCV. (pp. 6023–6032).
Zamir, A. R., Sax, A., Shen, W. B., Guibas, L. J., Malik, J., & Savarese, S. (2018). Taskonomy: Disentangling task transfer learning. In CVPR. (pp. 3712–3722).
Zhang, H., Cisse, M., Dauphin, Y. N., & Lopez-Paz, D. (2017). mixup: Beyond empirical risk minimization. ar**v preprintar**v:1710.09412.
Zhong, Z., Zheng, L., Kang, G., Li, S., & Yang, Y. (2020). Random erasing data augmentation. AAAI, 34(07), 13001–13008.
Article Google Scholar
Zhou, T., Brown, M., Snavely, N., & Lowe, D. G. (2017). Unsupervised learning of depth and ego-motion from video. In CVPR. (pp. 1851–1858).
Zini, S., Buzzelli, M., Twardowski, B., & van de Weijer, J. (2022). Planckian jitter: enhancing the color quality of self-supervised visual representations. ar**v preprintar**v:2202.07993.

Download references

Acknowledgements

This work was in part supported by the National Key R &D Program of China (No. 2022ZD0118700), and partly supported by the Ministry of Education, Singapore, under its Academic Research Fund Tier 2 (MOE-T2EP20220-0007). This work was also supported under the RIE2020 Industry Alignment Fund - Industry Collaboration Projects (IAF-ICP) Funding Initiative, as well as cash and in-kind contribution from the industry partner(s). Z. Cao was supported by the National Natural Science Foundation of China (No. U1913602).

Author information

Authors and Affiliations

EIC, Huazhong University of Science and Technology, Wuhan, China
Ke **an
S-Lab, Nanyang Technological University, Nanyang, Singapore
Ke **an & Guosheng Lin
AIA, Huazhong University of Science and Technology, Wuhan, China
Zhiguo Cao
Zhejiang University, Hangzhou, China
Chunhua Shen

Authors

Ke **an
View author publications
You can also search for this author in PubMed Google Scholar
Zhiguo Cao
View author publications
You can also search for this author in PubMed Google Scholar
Chunhua Shen
View author publications
You can also search for this author in PubMed Google Scholar
Guosheng Lin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Guosheng Lin.

Additional information

Communicated by D. Scharstein.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

**an, K., Cao, Z., Shen, C. et al. Towards Robust Monocular Depth Estimation: A New Baseline and Benchmark. Int J Comput Vis 132, 2401–2419 (2024). https://doi.org/10.1007/s11263-023-01979-4

Download citation

Received: 28 March 2023
Accepted: 16 December 2023
Published: 20 January 2024
Issue Date: July 2024
DOI: https://doi.org/10.1007/s11263-023-01979-4

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Towards Robust Monocular Depth Estimation: A New Baseline and Benchmark

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

A survey on Image Data Augmentation for Deep Learning

Deep learning models for digital image processing: a review

Image Matching from Handcrafted to Deep Features: A Survey

Data Availability

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Towards Robust Monocular Depth Estimation: A New Baseline and Benchmark

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

A survey on Image Data Augmentation for Deep Learning

Deep learning models for digital image processing: a review

Image Matching from Handcrafted to Deep Features: A Survey

Data Availability

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation