Log in

Dual-attention-based semantic-aware self-supervised monocular depth estimation

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Based on the assumption of photometric consistency, self-supervised monocular depth estimation has been widely studied due to the advantage of avoiding costly annotations. However, it is sensitive to noise, occlusion issues and photometric changes. To overcome these problems, we propose a multi-task model with a dual-attention-based cross-task feature fusion module (DCFFM). We simultaneously predict depth and semantic with a shared encoder and two separate decoders, aiming to improve depth estimation with the enhancement of semantic supervision information. In DCFFM, we fuse the cross-task features with both pixel-wise and channel-wise attention, which fully excavate and make good use of the helpful information from the other task mutually. We compute both of two attentions in a one-to-all manner to capture global information while limiting the rapid growth of computation. Furthermore, we propose a novel data augmentation method called data exchange & recovery (DE &R), which performs inter-batch data exchange in both vertical and horizontal direction so as to increase the diversity of input data. It encourages the network to explore more diversified cues for depth estimation and avoid overfitting. And essentially, the corresponding outputs are further recovered in order to keep the geometry relationship and ensure the correct calculation of photometric loss. Extensive experiments on the KITTI dataset and the NYU-Depth-v2 dataset demonstrate that our method is very effective and achieves better performance compared with other state-of-the-art works.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Algorithm 1
Fig. 7
Algorithm 2
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

Data Availability

The datasets generated during and/or analysed during the current study are available from the corresponding author on reasonable request.

References

  1. Klingner M, Termöhlen JA, Mikolajczyk J et al (2020) Self-supervised monocular depth estimation: Solving the dynamic object problem by semantic guidance. In: Computer vision–ECCV 2020: 16th European conference, Springer, pp 582–600

  2. Guizilini V, Ambrus R, Pillai S et al (2020) 3d packing for self-supervised monocular depth estimation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 2485–2494

  3. Tang C, Wang Y, Zhang L et al (2022) Multisource fusion uav cluster cooperative positioning using information geometry. Remote Sensing 14(21):5491

    Article  Google Scholar 

  4. Tang C, Wang C, Zhang L et al (2022) Multivehicle 3d cooperative positioning algorithm based on information geometric probability fusion of gnss/wireless station navigation. Remote Sensing 14(23):6094

    Article  Google Scholar 

  5. Eigen D, Puhrsch C, Fergus R (2014) Depth map prediction from a single image using a multi-scale deep network. In: Advances in neural information processing systems

  6. Fu H, Gong M, Wang C et al (2018) Deep ordinal regression network for monocular depth estimation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2002–2011

  7. Farooq Bhat S, Alhashim I, Wonka P (2021) Adabins: depth estimation using adaptive bins. In: 2021 IEEE/CVF Conference on computer vision and pattern recognition (CVPR), pp 4008–4017. https://doi.org/10.1109/CVPR46437.2021.00400

  8. **e J, Girshick R, Farhadi A (2016) Deep3d: fully automatic 2d-to-3d video conversion with deep convolutional neural networks. In: Computer Vision–ECCV 2016: 14th European conference, pp 842–857

  9. Garg R, B.G. VK, Carneiro G et al (2016) Unsupervised cnn for single view depth estimation: Geometry to the rescue. In: Computer Vision – ECCV 2016, Cham, pp 740–756

  10. Zhou T, Brown M, Snavely N et al (2017) Unsupervised learning of depth and ego-motion from video. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1851–1858

  11. Godard C, Mac Aodha O, Firman M et al (2019) Digging into self-supervised monocular depth estimation. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 3828–3838

  12. Shu C, Yu K, Duan Z et al (2020) Feature-metric loss for self-supervised learning of depth and egomotion. In: Computer vision–ECCV 2020: 16th European conference, pp 572–588

  13. Guizilini V, Hou R, Li J et al (2020) Semantically-guided representation learning for self-supervised monocular depth. ar**v:2002.12319

  14. Choi J, Jung D, Lee D et al (2020) Safenet: Self-supervised monocular depth estimation with semantic-aware feature extraction. ar**v:2010.02893

  15. Jung H, Park E, Yoo S (2021) Fine-grained semantics-aware representation enhancement for self-supervised monocular depth estimation. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 12,642–12,652

  16. Zama Ramirez P, Poggi M, Tosi F et al (2019) Geometry meets semantics for semi-supervised monocular depth estimation. In: Computer vision–ACCV 2018: 14th asian conference on computer vision, Springer, pp 298–313

  17. Zhu S, Brazil G, Liu X (2020) The edge of depth: explicit constraints between segmentation and depth. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 13,116–13,125

  18. Li R, Xue D, Su S, et al. (2023) Learning depth via leveraging semantics: self-supervised monocular depth estimation with both implicit and explicit semantic guidance. Pattern Recognition p 109297

  19. Cai H, Matai J, Borse S et al (2021) X-distill: improving self-supervised monocular depth via cross-task distillation. ar**v:2110.12516

  20. Peng R, Wang R, Lai Y et al (2021) Excavating the potential capacity of self-supervised monocular depth estimation. In: Proceedings of the IEEE/CVF International conference on computer vision, pp 15,560–15,569

  21. Godard C, Mac Aodha O, Brostow GJ (2017) Unsupervised monocular depth estimation with left-right consistency. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 270–279

  22. Poggi M, Tosi F, Mattoccia S (2018) Learning monocular depth estimation with unsupervised trinocular assumptions. In: 2018 International conference on 3d vision (3DV), IEEE, pp 324–333

  23. GonzalezBello JL, Kim M (2020) Forget about the lidar: self-supervised depth estimators with med probability volumes. In: Advances in neural information processing systems, pp 12,626–12,637

  24. Watson J, Firman M, Brostow GJ et al (2019) Self-supervised monocular depth hints. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 2162–2171

  25. Hirschmuller H (2007) Stereo processing by semiglobal matching and mutual information. IEEE Trans Pattern Anal Mach Intell 30(2):328–341

    Article  Google Scholar 

  26. Poggi M, Aleotti F, Tosi F et al (2020) On the uncertainty of self-supervised monocular depth estimation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 3227–3237

  27. Yang N, Stumberg Lv, Wang R et al (2020) D3vo: deep depth, deep pose and deep uncertainty for monocular visual odometry. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 1281–1292

  28. Ranjan A, Jampani V, Balles L et al (2019) Competitive collaboration: joint unsupervised learning of depth, camera motion, optical flow and motion segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 12,240–12,249

  29. Guizilini V, Lee KH, Ambruş R et al (2022) Learning optical flow, depth, and scene flow without real-world labels. IEEE Robotics Automation Lett 7(2):3491–3498

    Article  Google Scholar 

  30. Yin Z, Shi J (2018) Geonet: unsupervised learning of dense depth, optical flow and camera pose. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1983–1992

  31. **. ar**v:2305.11003

  32. He C, Li K, Zhang Y et al (2023b) Camouflaged object detection with feature decomposition and edge reconstruction. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 22,046–22,055

  33. He C, Li K, Zhang Y et al (2023c) Strategic preys make acute predators: Enhancing camouflaged object detectors by generating camouflaged objects. ar**v:2308.03166

  34. He K, Zhang X, Ren S et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778

  35. Wang Z, Bovik AC, Sheikh HR et al (2004) Image quality assessment: from error visibility to structural similarity. IEEE Trans Image Process 13(4):600–612

    Article  Google Scholar 

  36. Zhu Y, Sapra K, Reda FA et al (2019) Improving semantic segmentation via video propagation and label relaxation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 8856–8865

  37. Bolya D, Fu CY, Dai X et al (2023) Hydra attention: efficient attention with many heads. In: Computer vision–ECCV 2022 Workshops, Springer, pp 35–49

  38. Vaswani A, Shazeer N, Parmar N et al (2017) Attention is all you need. In: Advances in neural information processing systems

  39. Katharopoulos A, Vyas A, Pappas N et al (2020) Transformers are rnns: fast autoregressive transformers with linear attention. In: International conference on machine learning, pp 5156–5165

  40. Geiger A, Lenz P, Urtasun R (2012) Are we ready for autonomous driving? the kitti vision benchmark suite. In: 2012 IEEE conference on computer vision and pattern recognition, IEEE, pp 3354–3361

  41. Silberman N, Hoiem D, Kohli P et al (2012) Indoor segmentation and support inference from rgbd images. In: Computer Vision–ECCV 2012: 12th European conference on computer vision, Florence, Italy, October 7-13, 2012, Proceedings, Part V 12, Springer, pp 746–760

  42. Menze M, Geiger A (2015) Object scene flow for autonomous vehicles. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3061–3070

  43. Krizhevsky A, Sutskever I, Hinton GE (2017) Imagenet classification with deep convolutional neural networks. Commun ACM 60(6):84–90

    Article  Google Scholar 

  44. Paszke A, Gross S, Chintala S et al (2017) Automatic differentiation in pytorch. In: International conference on learning representations (ICLR)

  45. Kingma DP, Ba J (2015) Adam: a method for stochastic optimization. In: International conference on learning representations (ICLR)

  46. Bian J, Li Z, Wang N et al (2019) Unsupervised scale-consistent depth and ego-motion learning from monocular video. In: Advances in neural information processing systems

  47. Wang L, Wang Y, Wang L et al (2021) Can scale-consistent monocular depth be learned in a self-supervised scale-invariant manner? In: Proceedings of the IEEE/CVF international conference on computer vision, pp 12,727–12,736

  48. Dijk Tv, Croon Gd (2019) How do neural networks see depth in single images? In: Proceedings of the IEEE/CVF international conference on computer vision, pp 2183–2191

Download references

Acknowledgements

This work was supported by Department of science and technology of Guangdong Province (No:2021B01420003).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Feng Ye.

Ethics declarations

Conflicts of interest

The authors declare that they do not have any conflicts of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Xu, J., Ye, F. & Lai, Y. Dual-attention-based semantic-aware self-supervised monocular depth estimation. Multimed Tools Appl 83, 65579–65601 (2024). https://doi.org/10.1007/s11042-023-17976-1

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-023-17976-1

Keywords

Navigation