Log in

Attention-based context aggregation network for monocular depth estimation

  • Original Article
  • Published:
International Journal of Machine Learning and Cybernetics Aims and scope Submit manuscript

Abstract

Depth estimation is a traditional computer vision task, which plays a crucial role in understanding 3D scene geometry. Recently, algorithms that combine the multi-scale features extracted by the dilated convolution based block (atrous spatial pyramid pooling, ASPP) have gained significant improvements in depth estimation. However, the discretized and predefined dilation kernels cannot capture the continuous context information that differs in diverse scenes and easily introduce the grid artifacts. This paper proposes a novel algorithm, called attention-based context aggregation network (ACAN) for depth estimation. A supervised self-attention model is designed and utilized to adaptively learn the task-specific similarities between different pixels to model the continuous context information. Moreover, a soft ordinal inference is proposed to transform the predicted probabilities to continuous depth values which reduce the discretization error (about 1% decrease in RMSE). ACAN achieves state-of-the-art performance on public monocular depth-estimation benchmark datasets. The source code of ACAN can be found in https://github.com/miraiaroha/ACAN.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

References

  1. Silberman N, Hoiem D, Kohli P, Fergus R (2012) Indoor segmentation and support inference from rgbd images 7576(1):746–760

  2. Simon M, Milz S, Amende K, Gross HM (2018) Complex-yolo: real-time 3d object detection on point clouds

  3. Tateno K, Tombari F, Laina I, Navab N (2017) Cnn-slam: real-time dense monocular slam with learned depth prediction. p 6565–6574

  4. Laina I, Rupprecht C, Belagiannis V, Tombari F, Navab N (2016) Deeper depth prediction with fully convolutional residual networks. 3D Vision (3DV), 2016 fourth international conference on. p 239–248. IEEE

  5. Ghosh S, Pal A, Jaiswal S, Santosh KC, Das N, Nasipuri M (2019) Segfast-v2: Semantic image segmentation with less parameters in deep learning for autonomous driving. Int J Mach Learn Cybern 10(11):3145–3154

    Article  Google Scholar 

  6. Hirschmüller H (2005) Accurate and efficient stereo processing by semi-global matching and mutual information. IEEE computer society conference on computer vision and pattern recognition. p 807–814

  7. Roberts R, Sinha SN, Szeliski R, Steedly D (2011) Structure from motion for scenes with large duplicate structures. IEEE conference on computer vision and pattern recognition. p 3137–3144

  8. Eigen D, Puhrsch C, Fergus R (2014) Depth map prediction from a single image using a multi-scale deep network. Int Conf Neural Inf Process Syst. 1:2366–2374

    Google Scholar 

  9. Eigen D, Fergus R (2014) Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. pp. 2650–2658

  10. Ronneberger O, Fischer P, Brox T (2015) U-net: convolutional networks for biomedical image segmentation. International conference on medical image computing and computer-assisted intervention. p 234–241

  11. Newell A, Yang K, Deng J (2016) Stacked hourglass networks for human pose estimation. p 483–499

  12. Wei SE, Ramakrishna V, Kanade T, Sheikh Y (2016) Convolutional pose machines. p 4724–4732

  13. Huang J, Lee AB, Mumford D (2000) Statistics of range images. Comput Vis Pattern Recogn. Proceedings IEEE conference on. vol.1. p 324–331

  14. Yu F, Koltun V (2015) Multi-scale context aggregation by dilated convolutions. ar**v preprint ar**v:1511.07122

  15. LC Chen, G Papandreou, I Kokkinos, K Murphy, AL Yuille (2018) Deeplab Semantic image segmentation with deep convolutional nets atrous convolution and fully connected. IEEE Trans Pattern Anal Mach Intell 40(4): 834–848

    Article  Google Scholar 

  16. Chen LC, Papandreou G, Schroff F, Adam H (2017) Rethinking atrous convolution for semantic image segmentation

  17. Wang P, Chen P, Yuan Y, Liu D, Huang Z, Hou X, Cottrell G (2018) Understanding convolution for semantic segmentation. IEEE winter conference on applications of computer vision. p 1451–1460

  18. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I (2017) Attention is all you need

  19. Wang X, Girshick R, Gupta A, He K (2017) Non-local neural networks

  20. Yuan Y, Wang J (2018) Ocnet: Object context network for scene parsing. ar**v preprint ar**v:1809.00916

  21. Saxena A, Chung SH, Ng AY (2005) Learning depth from single monocular images. International conference on neural information processing systems. p 1161–1168

  22. Saxena A, Sun M, Ng AY (2007) Learning 3-d scene structure from a single still image. IEEE international conference on computer vision. p 1–8

  23. Liu B, Gould S, Koller D (2010) Single image depth estimation from predicted semantic labels. Comput Vis Pattern Recogn. p 1253–1260

  24. Ladicky L, Shi J, Pollefeys M (2014) Pulling things out of perspective. IEEE Conf Comput Vis Pattern Recogn 9:89–96

  25. Junjie H, Ozay M, Zhang Y, Okatani T (2018) Toward higher resolution maps with accurate object boundaries, revisiting single image depth estimation

  26. Han Yan, Shunli Zhang, Yu Zhang, and Li Zhang. Monocular depth estimation with guidance of surface normal map. Neurocomputing, 280:86–100, 2018

    Article  Google Scholar 

  27. Junning Zhang, Qunxing Su, Pengyuan Liu, Chao Xu, and Yanlong Chen. Unsupervised learning of monocular depth and ego-motion with spacešctemporal-centroid loss. International Journal of Machine Learning and Cybernetics, 11(3), 615–627, 2020

    Article  Google Scholar 

  28. Roy A, Todorovic S (2016) Monocular depth estimation using neural regression forest. Comput Vis Pattern Recogn. p 5506–5514

  29. Zwald L, Lambertlacroix S (2012) The berhu penalty and the grouped effect. Statistics

  30. Garg R, Vijay Kumar BG, Carneiro G, Reid I (2016) Unsupervised cnn for single view depth estimation: Geometry to the rescue. European conference on computer vision. p 740–756

  31. Godard C, Aodha OM, Brostow GJ (2017) Unsupervised monocular depth estimation with left-right consistency. Comput Vis Pattern Recogn. 1:6602–6611

    Google Scholar 

  32. Wang Z, Bovik AC, Sheikh HR, Simoncelli EP (2004) Image quality assessment: from error visibility to structural similarity. IEEE Trans Image Process 13(4):600–612

    Article  Google Scholar 

  33. Heise P, Klose S, Jensen B, Knoll A (2014) Pm-huber: Patchmatch with huber regularization for stereo matching. IEEE international conference on computer vision. p 2360–2367

  34. Saining **e and Zhuowen Tu. Holistically-nested edge detection. International Journal of Computer Vision, 125(1–3), 3–18, 2015

    MathSciNet  Google Scholar 

  35. Yu F, Koltun V, Funkhouser T (2017) Dilated residual networks. p 636–644

  36. Kim Y, Jung H, Min D, Sohn K (2018) Deep monocular depth estimation via integration of global and local predictions. IEEE Trans Image Process Publ IEEE Sig Process Soc 99:1–1

  37. Xu D, Ricci E, Ouyang W, Wang X, Sebe N (2017) Multi-scale continuous crfs as sequential deep networks for monocular depth estimation. p 161–169

  38. Liu F, Shen C, Lin G (2015) Deep convolutional neural fields for depth estimation from a single image. IEEE conference on computer vision and pattern recognition. p 5162–5170

  39. Li B, Shen C, Dai Y, Van Den Hengel A, He M (2015) Depth and surface normal estimation from monocular images using regression on deep features and hierarchical crfs. Comput Vis Pattern Recogn. p 1119–1127

  40. F. Liu, C. Shen, G. Lin, and I Reid. Learning depth from single monocular images using deep convolutional neural fields. IEEE Transactions on Pattern Analysis & Machine Intelligence, 38(10), 2024–2039, 2015

    Article  Google Scholar 

  41. Zhang Z, Xu C, Yang J, Gao J, Cui Z (2018) Progressive hard-mining network for monocular depth estimation. IEEE Trans Image Process. 99:1–1

    MathSciNet  MATH  Google Scholar 

  42. Li B, Dai Y, He M (2018) Monocular depth estimation with hierarchical fusion of dilated cnns and soft-weighted-sum inference. Pattern Recogn

  43. Moukari M, Picard S, Simon L, Jurie F (2018) Deep multi-scale architectures for monocular depth estimation. ar**v preprint ar**v:1806.03051

  44. Fu H, Gong M, Wang C, Batmanghelich K, Tao D (2018) Deep ordinal regression network for monocular depth estimation. Proceedings of the IEEE conference on computer vision and pattern recognition. p. 2002–2011

  45. Zheng S, Jayasumana S, Romera-Paredes B, Vineet V, Su Z, Du D, Huang C, Torr PHS (2015) Conditional random fields as recurrent neural networks. p 1529–1537

  46. Lin G, Shen C, Reid I, Van Dan Hengel A (2015) Efficient piecewise training of deep structured models for semantic segmentation. p 3194–3203

  47. Cao Y, Wu Zi, Shen C (2017) Estimating depth from monocular images as classification using deep fully convolutional residual networks. IEEE Trans Circ Syst Video Technol. 99:1–1

    Google Scholar 

  48. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. Proceedings of the IEEE conference on computer vision and pattern recognition. p. 770–778

  49. Zia T, Abbas A, Habib U, Khan MS (2020) Learning deep hierarchical and temporal recurrent neural networks with residual learning. Int J Mach Learn Cybern 11(4):873–882

    Article  Google Scholar 

  50. Zhou B, Khosla A, Lapedriza A, Oliva A, Torralba A (2015) Learning deep features for discriminative localization. p 2921–2929

  51. Liu W, Rabinovich A, Berg AC (2015) Parsenet: Looking wider to see better. ar**v preprint ar**v:1506.04579

  52. Li R, **an K, Shen C, Cao Z, Lu H, Hang L (2018) Deep attention-based classification network for robust depth prediction

  53. Niu Z, Zhou M, Wang L, Gao X, Hua G (2016) Ordinal regression with multiple output cnn for age estimation. The IEEE conference on computer vision and pattern recognition (CVPR)

  54. Geiger A (2012) Are we ready for autonomous driving? the kitti vision benchmark suite. IEEE conference on computer vision and pattern recognition. p 3354–3361

  55. Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, and Michael Bernstein. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3), 211–252, 2015

    Article  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Haitao Zhao.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chen, Y., Zhao, H., Hu, Z. et al. Attention-based context aggregation network for monocular depth estimation. Int. J. Mach. Learn. & Cyber. 12, 1583–1596 (2021). https://doi.org/10.1007/s13042-020-01251-y

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s13042-020-01251-y

Keywords

Navigation