Attention-based context aggregation network for monocular depth estimation

Chen, Yuru; Zhao, Haitao; Hu, Zhengwei; Peng, **gchao

doi:10.1007/s13042-020-01251-y

Attention-based context aggregation network for monocular depth estimation

Original Article
Published: 03 January 2021

Volume 12, pages 1583–1596, (2021)
Cite this article

International Journal of Machine Learning and Cybernetics Aims and scope Submit manuscript

Yuru Chen¹,
Haitao Zhao¹,
Zhengwei Hu¹ &
…
**gchao Peng¹

1142 Accesses
46 Citations
Explore all metrics

Abstract

Depth estimation is a traditional computer vision task, which plays a crucial role in understanding 3D scene geometry. Recently, algorithms that combine the multi-scale features extracted by the dilated convolution based block (atrous spatial pyramid pooling, ASPP) have gained significant improvements in depth estimation. However, the discretized and predefined dilation kernels cannot capture the continuous context information that differs in diverse scenes and easily introduce the grid artifacts. This paper proposes a novel algorithm, called attention-based context aggregation network (ACAN) for depth estimation. A supervised self-attention model is designed and utilized to adaptively learn the task-specific similarities between different pixels to model the continuous context information. Moreover, a soft ordinal inference is proposed to transform the predicted probabilities to continuous depth values which reduce the discretization error (about 1% decrease in RMSE). ACAN achieves state-of-the-art performance on public monocular depth-estimation benchmark datasets. The source code of ACAN can be found in https://github.com/miraiaroha/ACAN.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

DepthFormer: Exploiting Long-range Correlation and Local Information for Accurate Monocular Depth Estimation

Article Open access 13 September 2023

REAL-NET: A Monochromatic Depth Estimation Using REgional Attention and Local Feature Map**

Dual Attention Feature Fusion Network for Monocular Depth Estimation

References

Silberman N, Hoiem D, Kohli P, Fergus R (2012) Indoor segmentation and support inference from rgbd images 7576(1):746–760
Simon M, Milz S, Amende K, Gross HM (2018) Complex-yolo: real-time 3d object detection on point clouds
Tateno K, Tombari F, Laina I, Navab N (2017) Cnn-slam: real-time dense monocular slam with learned depth prediction. p 6565–6574
Laina I, Rupprecht C, Belagiannis V, Tombari F, Navab N (2016) Deeper depth prediction with fully convolutional residual networks. 3D Vision (3DV), 2016 fourth international conference on. p 239–248. IEEE
Ghosh S, Pal A, Jaiswal S, Santosh KC, Das N, Nasipuri M (2019) Segfast-v2: Semantic image segmentation with less parameters in deep learning for autonomous driving. Int J Mach Learn Cybern 10(11):3145–3154
Article Google Scholar
Hirschmüller H (2005) Accurate and efficient stereo processing by semi-global matching and mutual information. IEEE computer society conference on computer vision and pattern recognition. p 807–814
Roberts R, Sinha SN, Szeliski R, Steedly D (2011) Structure from motion for scenes with large duplicate structures. IEEE conference on computer vision and pattern recognition. p 3137–3144
Eigen D, Puhrsch C, Fergus R (2014) Depth map prediction from a single image using a multi-scale deep network. Int Conf Neural Inf Process Syst. 1:2366–2374
Google Scholar
Eigen D, Fergus R (2014) Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. pp. 2650–2658
Ronneberger O, Fischer P, Brox T (2015) U-net: convolutional networks for biomedical image segmentation. International conference on medical image computing and computer-assisted intervention. p 234–241
Newell A, Yang K, Deng J (2016) Stacked hourglass networks for human pose estimation. p 483–499
Wei SE, Ramakrishna V, Kanade T, Sheikh Y (2016) Convolutional pose machines. p 4724–4732
Huang J, Lee AB, Mumford D (2000) Statistics of range images. Comput Vis Pattern Recogn. Proceedings IEEE conference on. vol.1. p 324–331
Yu F, Koltun V (2015) Multi-scale context aggregation by dilated convolutions. ar**v preprint ar**v:1511.07122
LC Chen, G Papandreou, I Kokkinos, K Murphy, AL Yuille (2018) Deeplab Semantic image segmentation with deep convolutional nets atrous convolution and fully connected. IEEE Trans Pattern Anal Mach Intell 40(4): 834–848
Article Google Scholar
Chen LC, Papandreou G, Schroff F, Adam H (2017) Rethinking atrous convolution for semantic image segmentation
Wang P, Chen P, Yuan Y, Liu D, Huang Z, Hou X, Cottrell G (2018) Understanding convolution for semantic segmentation. IEEE winter conference on applications of computer vision. p 1451–1460
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I (2017) Attention is all you need
Wang X, Girshick R, Gupta A, He K (2017) Non-local neural networks
Yuan Y, Wang J (2018) Ocnet: Object context network for scene parsing. ar**v preprint ar**v:1809.00916
Saxena A, Chung SH, Ng AY (2005) Learning depth from single monocular images. International conference on neural information processing systems. p 1161–1168
Saxena A, Sun M, Ng AY (2007) Learning 3-d scene structure from a single still image. IEEE international conference on computer vision. p 1–8
Liu B, Gould S, Koller D (2010) Single image depth estimation from predicted semantic labels. Comput Vis Pattern Recogn. p 1253–1260
Ladicky L, Shi J, Pollefeys M (2014) Pulling things out of perspective. IEEE Conf Comput Vis Pattern Recogn 9:89–96
Junjie H, Ozay M, Zhang Y, Okatani T (2018) Toward higher resolution maps with accurate object boundaries, revisiting single image depth estimation
Han Yan, Shunli Zhang, Yu Zhang, and Li Zhang. Monocular depth estimation with guidance of surface normal map. Neurocomputing, 280:86–100, 2018
Article Google Scholar
Junning Zhang, Qunxing Su, Pengyuan Liu, Chao Xu, and Yanlong Chen. Unsupervised learning of monocular depth and ego-motion with spacešctemporal-centroid loss. International Journal of Machine Learning and Cybernetics, 11(3), 615–627, 2020
Article Google Scholar
Roy A, Todorovic S (2016) Monocular depth estimation using neural regression forest. Comput Vis Pattern Recogn. p 5506–5514
Zwald L, Lambertlacroix S (2012) The berhu penalty and the grouped effect. Statistics
Garg R, Vijay Kumar BG, Carneiro G, Reid I (2016) Unsupervised cnn for single view depth estimation: Geometry to the rescue. European conference on computer vision. p 740Ã¢â‚¬â€œ756
Godard C, Aodha OM, Brostow GJ (2017) Unsupervised monocular depth estimation with left-right consistency. Comput Vis Pattern Recogn. 1:6602–6611
Google Scholar
Wang Z, Bovik AC, Sheikh HR, Simoncelli EP (2004) Image quality assessment: from error visibility to structural similarity. IEEE Trans Image Process 13(4):600–612
Article Google Scholar
Heise P, Klose S, Jensen B, Knoll A (2014) Pm-huber: Patchmatch with huber regularization for stereo matching. IEEE international conference on computer vision. p 2360–2367
Saining **e and Zhuowen Tu. Holistically-nested edge detection. International Journal of Computer Vision, 125(1–3), 3–18, 2015
MathSciNet Google Scholar
Yu F, Koltun V, Funkhouser T (2017) Dilated residual networks. p 636–644
Kim Y, Jung H, Min D, Sohn K (2018) Deep monocular depth estimation via integration of global and local predictions. IEEE Trans Image Process Publ IEEE Sig Process Soc 99:1–1
Xu D, Ricci E, Ouyang W, Wang X, Sebe N (2017) Multi-scale continuous crfs as sequential deep networks for monocular depth estimation. p 161–169
Liu F, Shen C, Lin G (2015) Deep convolutional neural fields for depth estimation from a single image. IEEE conference on computer vision and pattern recognition. p 5162–5170
Li B, Shen C, Dai Y, Van Den Hengel A, He M (2015) Depth and surface normal estimation from monocular images using regression on deep features and hierarchical crfs. Comput Vis Pattern Recogn. p 1119–1127
F. Liu, C. Shen, G. Lin, and I Reid. Learning depth from single monocular images using deep convolutional neural fields. IEEE Transactions on Pattern Analysis & Machine Intelligence, 38(10), 2024–2039, 2015
Article Google Scholar
Zhang Z, Xu C, Yang J, Gao J, Cui Z (2018) Progressive hard-mining network for monocular depth estimation. IEEE Trans Image Process. 99:1–1
MathSciNet MATH Google Scholar
Li B, Dai Y, He M (2018) Monocular depth estimation with hierarchical fusion of dilated cnns and soft-weighted-sum inference. Pattern Recogn
Moukari M, Picard S, Simon L, Jurie F (2018) Deep multi-scale architectures for monocular depth estimation. ar**v preprint ar**v:1806.03051
Fu H, Gong M, Wang C, Batmanghelich K, Tao D (2018) Deep ordinal regression network for monocular depth estimation. Proceedings of the IEEE conference on computer vision and pattern recognition. p. 2002–2011
Zheng S, Jayasumana S, Romera-Paredes B, Vineet V, Su Z, Du D, Huang C, Torr PHS (2015) Conditional random fields as recurrent neural networks. p 1529–1537
Lin G, Shen C, Reid I, Van Dan Hengel A (2015) Efficient piecewise training of deep structured models for semantic segmentation. p 3194–3203
Cao Y, Wu Zi, Shen C (2017) Estimating depth from monocular images as classification using deep fully convolutional residual networks. IEEE Trans Circ Syst Video Technol. 99:1–1
Google Scholar
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. Proceedings of the IEEE conference on computer vision and pattern recognition. p. 770–778
Zia T, Abbas A, Habib U, Khan MS (2020) Learning deep hierarchical and temporal recurrent neural networks with residual learning. Int J Mach Learn Cybern 11(4):873–882
Article Google Scholar
Zhou B, Khosla A, Lapedriza A, Oliva A, Torralba A (2015) Learning deep features for discriminative localization. p 2921–2929
Liu W, Rabinovich A, Berg AC (2015) Parsenet: Looking wider to see better. ar**v preprint ar**v:1506.04579
Li R, **an K, Shen C, Cao Z, Lu H, Hang L (2018) Deep attention-based classification network for robust depth prediction
Niu Z, Zhou M, Wang L, Gao X, Hua G (2016) Ordinal regression with multiple output cnn for age estimation. The IEEE conference on computer vision and pattern recognition (CVPR)
Geiger A (2012) Are we ready for autonomous driving? the kitti vision benchmark suite. IEEE conference on computer vision and pattern recognition. p 3354–3361
Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, and Michael Bernstein. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3), 211–252, 2015
Article MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

School of Information Science and Engineering, East China University of Science and Technology, Shanghai, People’s Republic of China
Yuru Chen, Haitao Zhao, Zhengwei Hu & **gchao Peng

Authors

Yuru Chen
View author publications
You can also search for this author in PubMed Google Scholar
Haitao Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Zhengwei Hu
View author publications
You can also search for this author in PubMed Google Scholar
**gchao Peng
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Haitao Zhao.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Chen, Y., Zhao, H., Hu, Z. et al. Attention-based context aggregation network for monocular depth estimation. Int. J. Mach. Learn. & Cyber. 12, 1583–1596 (2021). https://doi.org/10.1007/s13042-020-01251-y

Download citation

Received: 01 July 2020
Accepted: 24 November 2020
Published: 03 January 2021
Issue Date: June 2021
DOI: https://doi.org/10.1007/s13042-020-01251-y

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Attention-based context aggregation network for monocular depth estimation

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

DepthFormer: Exploiting Long-range Correlation and Local Information for Accurate Monocular Depth Estimation

REAL-NET: A Monochromatic Depth Estimation Using REgional Attention and Local Feature Map**

Dual Attention Feature Fusion Network for Monocular Depth Estimation

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Attention-based context aggregation network for monocular depth estimation

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

DepthFormer: Exploiting Long-range Correlation and Local Information for Accurate Monocular Depth Estimation

REAL-NET: A Monochromatic Depth Estimation Using REgional Attention and Local Feature Map**

Dual Attention Feature Fusion Network for Monocular Depth Estimation

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation