Log in

Towards Robust Monocular Depth Estimation: A New Baseline and Benchmark

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

Before deploying a monocular depth estimation (MDE) model in real-world applications such as autonomous driving, it is critical to understand its generalization and robustness. Although the generalization of MDE models has been thoroughly studied, the robustness of the models has been overlooked in previous research. Existing state-of-the-art methods exhibit strong generalization to clean, unseen scenes. Such methods, however, appear to degrade when the test image is perturbed. This is likely because the prior arts typically use the primary 2D data augmentations (e.g., random horizontal flip**, random crop**, and color jittering), ignoring other common image degradation or corruptions. To mitigate this issue, we delve deeper into data augmentation and propose utilizing strong data augmentation techniques for robust depth estimation. In particular, we introduce 3D-aware defocus blur in addition to seven 2D data augmentations. We evaluate the generalization of our model on six clean RGB-D datasets that were not seen during training. To evaluate the robustness of MDE models, we create a benchmark by applying 15 common corruptions to the clean images from IBIMS, NYUDv2, KITTI, ETH3D, DIODE, and TUM. On this benchmark, we systematically study the robustness of our method and 9 representative MDE models. The experimental results demonstrate that our model exhibits better generalization and robustness than the previous methods. Specifically, we provide valuable insights about the choices of data augmentation strategies and network architectures, which would be useful for future research in robust monocular depth estimation. Our code, model, and benchmark can be available at https://github.com/KexianHust/Robust-MonoDepth.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Algorithm 1
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14

Similar content being viewed by others

Data Availability

The data that supports our findings are all publicly available online:1. HRWSI (**an et al., 2020): https://kexianhust.github.io/Structure-Guided-Ranking-Loss/. 2. 3DKenBurns (Niklaus et al., 2019): https://github.com/sniklaus/3d-ken-burns. 3. DrivingStereo (Yang et al., 2019): https://drivingstereo-dataset.github.io/. 4. MegaDepth (Li & Snavely, 2018): https://www.cs.cornell.edu/projects/megadepth/. 5. TartanAir (Wang et al., 2020): https://theairlab.org/tartanair-dataset/. 6. Taskonomy (Zamir et al., 2018): http://taskonomy.stanford.edu/. 7. Hypersim (Roberts et al., 2021): https://github.com/apple/ml-hypersim. 8. IRS (Wang et al., 2019): https://github.com/HKBU-HPML/IRS. 9. IBIMS (Koch et al., 2018): https://www.asg.ed.tum.de/lmf/ibims1/. 10. NYUDv2 (Silberman et al., 2012): https://cs.nyu.edu/~silberman/datasets/nyu_depth_v2.html. 11. KITTI (Uhrig et al., 2017): https://www.cvlibs.net/datasets/kitti/index.php. 12. ETH3D (Schöps et al., 2017): https://www.eth3d.net/datasets. 13. DIODE (Vasiljevic et al., 2019): https://diode-dataset.org/. 14. TUM (Sturm et al., 2012): https://cvg.cit.tum.de/data/datasets/rgbd-dataset. 15. OASIS (Chen et al., 2020): https://oasis.cs.princeton.edu/download.

Notes

  1. https://kornia.readthedocs.io/en/latest/augmentation.module.html

References

  • Bian, J.-W., Zhan, H., Wang, N., Li, Z., Zhang, L., Shen, C., et al. (2021). Unsupervised scale-consistent depth learning from video. IJCV, 129(9), 2548–2564.

    Article  Google Scholar 

  • Chen, W., Fu, Z., Yang, D., & Deng, J. (2016). Single-image depth perception in the wild. In NeurIPS. (pp. 730–738).

  • Chen, W., Qian, S., & Deng, J. (2019). Learning single-image depth from videos using quality assessment networks. In CVPR. (pp. 5604–5613).

  • Chen, W., Qian, S., Fan, D., Kojima, N., Hamilton, M., & Deng, J. (2020). Oasis: A large-scale dataset for single image 3d in the wild. In CVPR. (pp. 679–688).

  • Cubuk, E. D., Zoph, B., Mane, D., Vasudevan, V., & Le, Q. V. (2019). Autoaugment: Learning augmentation strategies from data. In CVPR. (pp. 113–123).

  • DeVries, T., & Taylor, G. W. (2017). Improved regularization of convolutional neural networks with cutout. ar**v preprint ar**v:1708.04552.

  • Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., et al. (2020). An image is worth 16x16 words: Transformers for image recognition at scale. ar**v preprintar**v:2010.11929.

  • Eigen, D., & Fergus, R. (2015). Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In ICCV. (pp. 2650–2658).

  • Eigen, D., Puhrsch, C., & Fergus, R. (2014). Depth map prediction from a single image using a multi-scale deep network. In NeurIPS, volume 27. (pp. 1–9).

  • Fu, H., Gong, M., Wang, C., Batmanghelich, K., & Tao, D. (2018). Deep ordinal regression network for monocular depth estimation. In CVPR. (pp. 2002–2011).

  • Godard, C., Mac Aodha, O., & Brostow, G. J. (2017). Unsupervised monocular depth estimation with left-right consistency. In CVPR. (pp. 270–279).

  • Godard, C., Mac Aodha, O., Firman, M., & Brostow, G. J. (2019). Digging into self-supervised monocular depth estimation. In CVPR. (pp. 3828–3838).

  • Hendrycks, D., & Dietterich, T. (2019). Benchmarking neural network robustness to common corruptions and perturbations. ar**v preprintar**v:1903.12261.

  • Kamann, C., & Rother, C. (2021). Benchmarking the robustness of semantic segmentation models with respect to common corruptions. IJCV, 129, 462–483.

    Article  Google Scholar 

  • Kar, O. F., Yeo, T., Atanov, A., & Zamir, A. (2022). 3d common corruptions and data augmentation. In CVPR. (pp. 18963–18974).

  • Koch, T., Liebel, L., Fraundorfer, F., & Körner, M. (2018). Evaluation of CNN-based single-image depth estimation methods. In ECCVW. (pp. 331–348).

  • Laina, I., Rupprecht, C., Belagiannis, V., Tombari, F., & Navab, N. (2016). Deeper depth prediction with fully convolutional residual networks. In Proceeding of IEEE International Conference 3D Vision. (pp. 239–248).

  • Lasinger, K., Ranftl, R., Schindler, K., & Koltun, V. (2020). Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. IEEE TPAMI, 44(3), 1623–1637.

    Google Scholar 

  • Lee, H., & Park, J. (2022). Instance-wise occlusion and depth orders in natural scenes. In CVPR. (pp. 21210–21221).

  • Lee, S., Rameau, F., Im, S., & Kweon, I. S. (2022). Self-supervised monocular depth and motion learning in dynamic scenes: Semantic prior to rescue. IJCV, 130(9), 2265–2285.

    Article  Google Scholar 

  • Li, Z., Niklaus, S., Snavely, N., & Wang, O. (2021). Neural scene flow fields for space-time view synthesis of dynamic scenes. In CVPR. (pp. 6498–6508).

  • Li, Z., & Snavely, N. (2018). Megadepth: Learning single-view depth prediction from internet photos. In CVPR. (pp. 2041–2050).

  • Niklaus, S., Mai, L., Yang, J., & Liu, F. (2019). 3D Ken burns effect from a single image. ACM TOG, 38(6), 1841–18415.

    Article  Google Scholar 

  • Peng, J., Cao, Z., Luo, X., Lu, H., **an, K., & Zhang, J. (2022). Bokehme: When neural rendering meets classical rendering. In CVPR. (pp. 16283–16292).

  • Ranftl, R., Bochkovskiy, A., & Koltun, V. (2021). Vision transformers for dense prediction. In ICCV. (pp. 12179–12188).

  • Roberts, M., Ramapuram, J., Ranjan, A., Kumar, A., Bautista, M. A., Paczan, N., et al. (2021). Hypersim: A photorealistic synthetic dataset for holistic indoor scene understanding. In ICCV. (pp. 10912–10922).

  • Saleh, B. E., & Teich, M. C. (2019). Fundamentals of photonics. London: Wiley.

    Google Scholar 

  • Schöps, T., Schönberger, J. L., Galliani, S., Sattler, T., Schindler, K., Pollefeys, M., et al. (2017). A multi-view stereo benchmark with high-resolution images and multi-camera videos. In CVPR. (pp. 3260–3269).

  • Silberman, N., Hoiem, D., Kohli, P., & Fergus, R. (2012). Indoor segmentation and support inference from rgbd images. In ECCV. (pp. 746–760).

  • Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. ar**v preprintar**v:1409.1556.

  • Sturm, J., Engelhard, N., Endres, F., Burgard, W., & Cremers, D. (2012). A benchmark for the evaluation of rgb-d slam systems. In IROS. (pp. 573–580).

  • Teed, Z., & Deng, J. (2020). Raft: Recurrent all-pairs field transforms for optical flow. In ECCV. (pp. 402–419).

  • Uhrig, J., Schneider, N., Schneider, L., Franke, U., Brox, T., & Geiger, A. (2017). Sparsity invariant CNNS. In Proceeding of IEEE International Conference of 3D Vision. (pp. 11–20).

  • Vasiljevic, I., Kolkin, N., Zhang, S., Luo, R., Wang, H., Dai, F. Z., et al. (1908). 2019 (p. 00463). DIODE: A dense indoor and outdoor depth dataset. arxiv.

  • Wadhwa, N., Garg, R., Jacobs, D. E., Feldman, B. E., Kanazawa, N., Carroll, R., et al. (2018). Synthetic depth-of-field with a single-camera mobile phone. ACM TOG, 37(4), 1–13.

    Article  Google Scholar 

  • Wang, L., Shen, X., Zhang, J., Wang, O., Lin, Z., Hsieh, C.-Y., et al. (2018). Deeplens: Shallow depth of field from a single image. ACM TOG, 37(6), 1–11.

    Google Scholar 

  • Wang, Q., Li, Z., Salesin, D., Snavely, N., Curless, B., & Kontkanen, J. (2022). 3d moments from near-duplicate photos. In CVPR. (pp. 3906–3915).

  • Wang, Q., Zheng, S., Yan, Q., Deng, F., Zhao, K., & Chu, X. (2019). Irs: A large synthetic indoor robotics stereo dataset for disparity and surface normal estimation. ar**v preprintar**v:1912.09678, 6.

  • Wang, W., Zhu, D., Wang, X., Hu, Y., Qiu, Y., & Wang, C., et al. (2020). Tartanair: A dataset to push the limits of visual slam. In IROS. (pp. 4909–4916).

  • **an, K., Shen, C., Cao, Z., Lu, H., **ao, Y., & Li, R., et al. (2018). Monocular relative depth perception with web stereo data supervision. In CVPR. (pp. 311–320).

  • **an, K., Zhang, J., Wang, O., Mai, L., Lin, Z., & Cao, Z. (2020). Structure-guided ranking loss for single image depth prediction. In CVPR. (pp. 611–620).

  • Xu, D., Ricci, E., Ouyang, W., Wang, X., & Sebe, N. (2017). Multi-scale continuous crfs as sequential deep networks for monocular depth estimation. In CVPR. (pp. 5354–5362).

  • Yang, G., Song, X., Huang, C., Deng, Z., Shi, J., & Zhou, B. (2019). Drivingstereo: A large-scale dataset for stereo matching in autonomous driving scenarios. In CVPR. (pp. 899–908).

  • Yin, W., Liu, Y., & Shen, C. (2022). Virtual normal: Enforcing geometric constraints for accurate and robust depth prediction. IEEE TPAMI, 44(10), 7282–7295.

    Article  Google Scholar 

  • Yin, W., Zhang, J., Wang, O., Niklaus, S., Mai, L., & Chen, S., et al. (2021). Learning to recover 3d scene shape from a single image. In CVPR. (pp. 204–213).

  • Yoon, J. S., Kim, K., Gallo, O., Park, H. S., & Kautz, J. (2020). Novel view synthesis of dynamic scenes with globally coherent depths from a monocular camera. In CVPR. (pp. 5336–5345).

  • Yuan, J., Liu, Y., Shen, C., Wang, Z., & Li, H. (2021). A simple baseline for semi-supervised semantic segmentation with strong data augmentation. In ICCV. (pp. 8229–8238).

  • Yun, S., Han, D., Oh, S. J., Chun, S., Choe, J., & Yoo, Y. (2019). Cutmix: Regularization strategy to train strong classifiers with localizable features. In ICCV. (pp. 6023–6032).

  • Zamir, A. R., Sax, A., Shen, W. B., Guibas, L. J., Malik, J., & Savarese, S. (2018). Taskonomy: Disentangling task transfer learning. In CVPR. (pp. 3712–3722).

  • Zhang, H., Cisse, M., Dauphin, Y. N., & Lopez-Paz, D. (2017). mixup: Beyond empirical risk minimization. ar**v preprintar**v:1710.09412.

  • Zhong, Z., Zheng, L., Kang, G., Li, S., & Yang, Y. (2020). Random erasing data augmentation. AAAI, 34(07), 13001–13008.

    Article  Google Scholar 

  • Zhou, T., Brown, M., Snavely, N., & Lowe, D. G. (2017). Unsupervised learning of depth and ego-motion from video. In CVPR. (pp. 1851–1858).

  • Zini, S., Buzzelli, M., Twardowski, B., & van de Weijer, J. (2022). Planckian jitter: enhancing the color quality of self-supervised visual representations. ar**v preprintar**v:2202.07993.

Download references

Acknowledgements

This work was in part supported by the National Key R &D Program of China (No. 2022ZD0118700), and partly supported by the Ministry of Education, Singapore, under its Academic Research Fund Tier 2 (MOE-T2EP20220-0007). This work was also supported under the RIE2020 Industry Alignment Fund - Industry Collaboration Projects (IAF-ICP) Funding Initiative, as well as cash and in-kind contribution from the industry partner(s). Z. Cao was supported by the National Natural Science Foundation of China (No. U1913602).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Guosheng Lin.

Additional information

Communicated by D. Scharstein.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

**an, K., Cao, Z., Shen, C. et al. Towards Robust Monocular Depth Estimation: A New Baseline and Benchmark. Int J Comput Vis 132, 2401–2419 (2024). https://doi.org/10.1007/s11263-023-01979-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-023-01979-4

Keywords

Navigation