Learned Monocular Depth Priors in Visual-Inertial Initialization

  • Conference paper
  • First Online:
Computer Vision – ECCV 2022 (ECCV 2022)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13682))

Included in the following conference series:

Abstract

Visual-inertial odometry (VIO) is the pose estimation backbone for most AR/VR and autonomous robotic systems today, in both academia and industry. However, these systems are highly sensitive to the initialization of key parameters such as sensor biases, gravity direction, and metric scale. In practical scenarios where high-parallax or variable acceleration assumptions are rarely met (e.g. hovering aerial robot, smartphone AR user not gesticulating with phone), classical visual-inertial initialization formulations often become ill-conditioned and/or fail to meaningfully converge. In this paper we target visual-inertial initialization specifically for these low-excitation scenarios critical to in-the-wild usage. We propose to circumvent the limitations of classical visual-inertial structure-from-motion (SfM) initialization by incorporating a new learning-based measurement as a higher-level input. We leverage learned monocular depth images (mono-depth) to constrain the relative depth of features, and upgrade the mono-depths to metric scale by jointly optimizing for their scales and shifts. Our experiments show a significant improvement in problem conditioning compared to a classical formulation for visual-inertial initialization, and demonstrate significant accuracy and robustness improvements relative to the state-of-the-art on public benchmarks, particularly under low-excitation scenarios. We further extend this improvement to implementation within an existing odometry system to illustrate the impact of our improved initialization method on resulting tracking trajectories.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free ship** worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Agarwal, S., Mierle, K., Others: Ceres solver. https://ceres-solver.org

  2. Almalioglu, Y., et al.: SelfVIO: self-supervised deep monocular visual-inertial odometry and depth estimation. CoRR abs/1911.09968 (2019). https://doi.org/arxiv.org/abs/1911.09968

  3. Barron, J.T.: A general and adaptive robust loss function. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4331–4339 (2019)

    Google Scholar 

  4. Bloesch, M., Czarnowski, J., Clark, R., Leutenegger, S., Davison, A.J.: CodeSLAM-learning a compact, optimisable representation for dense visual slam. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2560–2568 (2018)

    Google Scholar 

  5. Burru, M., et al.: The EuRoC micro aerial vehicle datasets. Int. J. Robot. Res. 35(10), 1157–1163 (2016)

    Article  Google Scholar 

  6. Campos, C., Elvira, R., Rodríguez, J.J.G., Montiel, J.M., Tardós, J.D.: ORB-SLAM3: an accurate open-source library for visual, visual-inertial, and multimap slam. IEEE Trans. Robot. 37(6), 1874–1890 (2021)

    Article  Google Scholar 

  7. Campos, C., Montiel, J.M.M., Tardós, J.D.: Fast and robust initialization for visual-inertial SLAM. CoRR abs/1908.10653 (2019), https://doi.org/arxiv.org/abs/1908.10653

  8. Campos, C., Montiel, J.M., Tardós, J.D.: Inertial-only optimization for visual-inertial initialization. In: 2020 IEEE International Conference on Robotics and Automation (ICRA), pp. 51–57. IEEE (2020)

    Google Scholar 

  9. Chen, C., Lu, X., Markham, A., Trigoni, N.: IONet: learning to cure the curse of drift in inertial odometry. In: Proceedings of the AAAI Conference on Artificial Intelligence (2018)

    Google Scholar 

  10. Chen, C., et al.: Selective sensor fusion for neural visual-inertial odometry. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10542–10551 (2019)

    Google Scholar 

  11. Civera, J., Davison, A.J., Montiel, J.M.: Inverse depth parametrization for monocular slam. IEEE Trans. Rob. 24(5), 932–945 (2008)

    Article  Google Scholar 

  12. Clark, R., Wang, S., Wen, H., Markham, A., Trigoni, N.: ViNet: visual-inertial odometry as a sequence-to-sequence learning problem. In: Proceedings of the AAAI Conference on Artificial Intelligence (2017)

    Google Scholar 

  13. Concha, A., Civera, J.: RGBDTAM: A cost-effective and accurate RGB-D tracking and map** system. CoRR abs/1703.00754 (2017). https://doi.org/arxiv.org/abs/1703.00754

  14. Dai, A., Chang, A.X., Savva, M., Halber, M., Funkhouser, T., Nießner, M.: ScanNet: richly-annotated 3d reconstructions of indoor scenes. In: Proceedings Computer Vision and Pattern Recognition (CVPR). IEEE (2017)

    Google Scholar 

  15. Du, R., et al.: DepthLab: real-time 3D interaction with depth maps for mobile augmented reality. In: Proceedings of the 33rd Annual ACM Symposium on User Interface Software and Technology, pp. 829–843 (2020)

    Google Scholar 

  16. Eigen, D., Puhrsch, C., Fergus, R.: Depth map prediction from a single image using a multi-scale deep network. CoRR abs/1406.2283 (2014). https://doi.org/arxiv.org/abs/1406.2283

  17. Endres, F., Hess, J., Sturm, J., Cremers, D., Burgard, W.: 3-D map** with an RGB-D camera. IEEE Trans. Rob. 30(1), 177–187 (2013)

    Article  Google Scholar 

  18. Fei, X., Soatto, S.: **vo: an open-source software for visual-inertial odometry (2019). https://doi.org/github.com/ucla-vision/xivo

  19. Forster, C., Carlone, L., Dellaert, F., Scaramuzza, D.: On-manifold preintegration theory for fast and accurate visual-inertial navigation. CoRR abs/1512.02363 (2015). https://doi.org/arxiv.org/abs/1512.02363

  20. Forster, C., Pizzoli, M., Scaramuzza, D.: SVO: fast semi-direct monocular visual odometry. In: 2014 IEEE International Conference on Robotics and Automation (ICRA), pp. 15–22. IEEE (2014)

    Google Scholar 

  21. Garg, R., Wadhwa, N., Ansari, S., Barron, J.T.: Learning single camera depth estimation using dual-pixels. CoRR abs/1904.05822 (2019). https://doi.org/arxiv.org/abs/1904.05822

  22. Geiger, A., Lenz, P., Stiller, C., Urtasun, R.: Vision meets robotics: the kitti dataset. Int. J. Rob. Res. (IJRR) 32(11), 1231–1237 (2013)

    Article  Google Scholar 

  23. Geneva, P., Eckenhoff, K., Lee, W., Yang, Y., Huang, G.: OpenVINS: a research platform for visual-inertial estimation. In: 2020 IEEE International Conference on Robotics and Automation (ICRA), pp. 4666–4672. IEEE (2020)

    Google Scholar 

  24. Guennebaud, G., Jacob, B., et al.: Eigen v3 (2010). https://eigen.tuxfamily.org

  25. Guo, C.X., Roumeliotis, S.I.: IMU-RGBD camera 3D pose estimation and extrinsic calibration: observability analysis and consistency improvement. In: 2013 IEEE International Conference on Robotics and Automation, pp. 2935–2942 (2013). https://doi.org/10.1109/ICRA.2013.6630984

  26. Han, L., Lin, Y., Du, G., Lian, S.: DeepVIO: self-supervised deep learning of monocular visual inertial odometry using 3D geometric constraints. In: 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 6906–6913. IEEE (2019)

    Google Scholar 

  27. Hartley, R.I., Zisserman, A.: Multiple View Geometry in Computer Vision, 2nd edn. Cambridge University Press, ISBN: 0521540518 (2004)

    Google Scholar 

  28. Herath, S., Yan, H., Furukawa, Y.: RoNIN: robust neural inertial navigation in the wild: benchmark, evaluations, and new methods. In: 2020 IEEE International Conference on Robotics and Automation (ICRA), pp. 3146–3152 (2020). https://doi.org/10.1109/ICRA40945.2020.9196860

  29. Hernandez, J., Tsotsos, K., Soatto, S.: Observability, identifiability and sensitivity of vision-aided inertial navigation. In: 2015 IEEE International Conference on Robotics and Automation (ICRA), pp. 2319–2325. IEEE (2015)

    Google Scholar 

  30. Huai, Z., Huang, G.: Robocentric visual-inertial odometry. In: 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 6319–6326. IEEE (2018)

    Google Scholar 

  31. Huang, G.: Visual-inertial navigation: a concise review. In: 2019 International Conference on Robotics and Automation (ICRA), pp. 9572–9582 (2019). https://doi.org/10.1109/ICRA.2019.8793604

  32. Huber, P.J.: Robust estimation of a location parameter. In: Kotz, S., Johnson, N.L. (eds.) Breakthroughs in statistics, pp. 492–518. Springer, New York (1992). https://doi.org/10.1007/978-1-4612-4380-9_35

  33. Jones, E., Vedaldi, A., Soatto, S.: Inertial structure from motion with autocalibration. In: Workshop on Dynamical Vision, vol. 25, p. 11 (2007)

    Google Scholar 

  34. Kaiser, J., Martinelli, A., Fontana, F., Scaramuzza, D.: Simultaneous state initialization and gyroscope bias calibration in visual inertial aided navigation. IEEE Rob. Autom. Lett. 2(1), 18–25 (2017). https://doi.org/10.1109/LRA.2016.2521413

    Article  Google Scholar 

  35. Kelly, J., Sukhatme, G.S.: Visual-inertial sensor fusion: localization, map** and sensor-to-sensor self-calibration. Int. J. Rob. Res. 30(1), 56–79 (2011)

    Article  Google Scholar 

  36. Kopf, J., Rong, X., Huang, J.B.: Robust consistent video depth estimation. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (2021)

    Google Scholar 

  37. Krasin, I., et al.: OpenImages: a public dataset for large-scale multi-label and multi-class image classification. Dataset available from https://storage.googleapis.com/openimages/web/index.html (2017)

  38. Lepetit, V., Moreno-Noguer, F., Fua, P.: EPNP: an accurate o(n) solution to the PNP problem. Int. J. Computer Vis. 81(2), 155 (2009)

    Article  Google Scholar 

  39. Leutenegger, S., Lynen, S., Bosse, M., Siegwart, R., Furgale, P.: Keyframe-based visual-inertial odometry using nonlinear optimization. Int. J. Rob. Res. 34(3), 314–334 (2015)

    Article  Google Scholar 

  40. Leutenegger, S., Lynen, S., Bosse, M., Siegwart, R., Furgale, P.: Keyframe-based visual-inertial odometry using nonlinear optimization. Int. J. Rob. Res. 34(3), 314–334 (2015)

    Article  Google Scholar 

  41. Li, C., Waslander, S.L.: Towards end-to-end learning of visual inertial odometry with an EKF. In: 2020 17th Conference on Computer and Robot Vision (CRV), pp. 190–197. IEEE (2020)

    Google Scholar 

  42. Li, J., Bao, H., Zhang, G.: Rapid and robust monocular visual-inertial initialization with gravity estimation via vertical edges. In: 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 6230–6236 (2019). https://doi.org/10.1109/IROS40897.2019.8968456

  43. Li, M., Mourikis, A.I.: A convex formulation for motion estimation using visual and inertial sensors. In: Proceedings of the Workshop on Multi-View Geometry, held in conjunction with RSS. Berkeley, CA, July 2014

    Google Scholar 

  44. Li, M., Mourikis, A.I.: High-precision, consistent EKF-based visual-inertial odometry. Int. J. Rob. Res. 32(6), 690–711 (2013)

    Article  Google Scholar 

  45. Li, Z., et al.: Learning the depths of moving people by watching frozen people. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019)

    Google Scholar 

  46. Liu, W., et al.: TLIO: tight learned inertial odometry. IEEE Rob. Autom. Lett. 5(4), 5653–5660 (2020)

    Article  Google Scholar 

  47. Martinelli, A.: Closed-form solution of visual-inertial structure from motion. Int. J. Comput. Vision 106(2), 138–152 (2014)

    Article  MathSciNet  Google Scholar 

  48. Mur-Artal, R., Tardós, J.D.: ORB-SLAM2: an open-source slam system for monocular, stereo, and RGB-D cameras. IEEE Trans. Rob. 33(5), 1255–1262 (2017). https://doi.org/10.1109/TRO.2017.2705103

    Article  Google Scholar 

  49. Qin, T., Li, P., Shen, S.: VINS-Mono: a robust and versatile monocular visual-inertial state estimator. CoRR abs/1708.03852 (2017). https://doi.org/arxiv.org/abs/1708.03852

  50. Qin, T., Shen, S.: Robust initialization of monocular visual-inertial estimation on aerial robots. In: 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 4225–4232 (2017). https://doi.org/10.1109/IROS.2017.8206284

  51. Ranftl, R., Bochkovskiy, A., Koltun, V.: Vision transformers for dense prediction. Ar**v preprint (2021)

    Google Scholar 

  52. Ranftl, R., Lasinger, K., Hafner, D., Schindler, K., Koltun, V.: Towards robust monocular depth estimation: mixing datasets for zero-shot cross-dataset transfer. IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI) 44(3), 1623–1637 (2020)

    Article  Google Scholar 

  53. Scaramuzza, D., Fraundorfer, F.: Visual odometry [tutorial]. IEEE Rob. Autom. Mag. 18(4), 80–92 (2011). https://doi.org/10.1109/MRA.2011.943233

    Article  Google Scholar 

  54. Tang, C., Tan, P.: BA-Net: dense bundle adjustment networks. In: International Conference on Learning Representations (2018)

    Google Scholar 

  55. Troiani, C., Martinelli, A., Laugier, C., Scaramuzza, D.: 2-point-based outlier rejection for camera-IMU systems with applications to micro aerial vehicles. In: 2014 IEEE International Conference on Robotics and Automation (ICRA), pp. 5530–5536 (2014). https://doi.org/10.1109/ICRA.2014.6907672

  56. Tsotsos, K., Chiuso, A., Soatto, S.: Robust inference for visual-inertial sensor fusion. In: 2015 IEEE International Conference on Robotics and Automation (ICRA), pp. 5203–5210. IEEE (2015)

    Google Scholar 

  57. Von Stumberg, L., Usenko, V., Cremers, D.: Direct sparse visual-inertial odometry using dynamic marginalization. In: 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 2510–2517. IEEE (2018)

    Google Scholar 

  58. Wang, S., Clark, R., Wen, H., Trigoni, N.: DeepVO: towards end-to-end visual odometry with deep recurrent convolutional neural networks. In: 2017 IEEE International Conference on Robotics and Automation (ICRA), pp. 2043–2050. IEEE (2017)

    Google Scholar 

  59. Whelan, T., Leutenegger, S., Salas-Moreno, R., Glocker, B., Davison, A.: Elasticfusion: dense slam without a pose graph. In: Robotics: Science and Systems (2015)

    Google Scholar 

  60. Wu, K.J., Guo, C.X., Georgiou, G., Roumeliotis, S.I.: Vins on wheels. In: 2017 IEEE International Conference on Robotics and Automation (ICRA), pp. 5155–5162. IEEE (2017)

    Google Scholar 

  61. Zuo, X., Merrill, N., Li, W., Liu, Y., Pollefeys, M., Huang, G.: Codevio: visual-inertial odometry with learned optimizable dense depth. In: 2021 IEEE International Conference on Robotics and Automation (ICRA), pp. 14382–14388. IEEE (2021)

    Google Scholar 

  62. Zuñiga-Noël, D., Moreno, F.A., Gonzalez-Jimenez, J.: An analytical solution to the IMU initialization problem for visual-inertial systems. IEEE Rob. Autom. Lett. 6(3), 6116–6122 (2021). https://doi.org/10.1109/LRA.2021.3091407

    Article  Google Scholar 

Download references

Acknowledgements

We thank Josh Hernandez and Maksym Dzitsiuk for their support in develo** our real-time system implementation.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yunwen Zhou .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 248 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Zhou, Y. et al. (2022). Learned Monocular Depth Priors in Visual-Inertial Initialization. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13682. Springer, Cham. https://doi.org/10.1007/978-3-031-20047-2_32

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-20047-2_32

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-20046-5

  • Online ISBN: 978-3-031-20047-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Navigation