MMRCA: MultiModal Root Cause Analysis

  • Conference paper
  • First Online:
Service-Oriented Computing – ICSOC 2021 Workshops (ICSOC 2021)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13236))

Included in the following conference series:

Abstract

Cloud systems are becoming increasingly complex and more difficult for human operators to manage due to the scale and interconnectedness of microservices. Increased observability and anomaly detection are able to alert when something has gone wrong, however a fault can propagate throughout the cloud leading to a large number of alerts. It is difficult for operators to manage these large number of alerts and to differentiate between the symptoms that have propagated due to the fault and the actual root cause. In this paper we present a MultiModal Root Cause Analysis algorithm called MMRCA. This algorithm leverages data from traces, topology, configurations and metrics to accurately predict the root cause of the fault. Our approach consists of a three step pipeline of topology reduction, metric causality and metric reduction. The experimental results show that MMRCA can accurately detect the root cause in a number of different data sets, while maintaining an efficient use of resources and scaling to a large deployment.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 64.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 84.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free ship** worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Bejerano, Y., Breitbart, Y., Garofalakis, M., Rastogi, R.: Physical topology discovery for large multisubNet networks. In: IEEE INFOCOM 2003. Twenty-second Annual Joint Conference of the IEEE Computer and Communications Societies. vol. 1, pp. 342–352 (2003)

    Google Scholar 

  2. Beyer, B., Jones, C., Petoff, J., Murphy, N.R.: Site Reliability Engineering: How Google Runs Production Systems. O’Reilly Media, Inc., Sebastopol (2016)

    Google Scholar 

  3. Bogatinovski, J., et al.: Artificial intelligence for it operations (AIOPS) workshop white paper. ar**v preprint ar**v:2101.06054 (2021)

  4. Cai, Z., Li, W., Zhu, W., Liu, L., Yang, B.: A real-time trace-level root-cause diagnosis system in Alibaba datacenters. IEEE Access 7, 142692–142702 (2019)

    Article  Google Scholar 

  5. Dang, Y., Lin, Q., Huang, P.: AIOPS: real-world challenges and research innovations. In: 2019 IEEE/ACM 41st International Conference on Software Engineering: Companion Proceedings (ICSE-Companion). pp. 4–5. IEEE (2019)

    Google Scholar 

  6. Dynatrace: root cause analysis. https://www.dynatrace.com/support/help/how-to-use-dynatrace/problem-detection-and-analysis/problem-analysis/root-cause-analysis/. Accessed 4 Apr 2022

  7. Gao, J., Li, P., Chen, Z., Zhang, J.: A survey on deep learning for multimodal data fusion. Neural Comput. 32(5), 829–864 (2020)

    Article  MathSciNet  Google Scholar 

  8. Josefsson, T.: Root-cause analysis through machine learning in the cloud (2017)

    Google Scholar 

  9. Lahat, D., Adali, T., Jutten, C.: Multimodal data fusion: An overview of methods, challenges, and prospects. Proc. IEEE 103(9), 1449–1477 (2015)

    Article  Google Scholar 

  10. Levy, S., et al.: Predictive and adaptive failure mitigation to avert production cloud VM interruptions. In: 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20). pp. 1155–1170, November 2020

    Google Scholar 

  11. Li, Z., et al.: Practical root cause localization for microservice systems via trace analysis. In: 2021 IEEE/ACM 29th International Symposium on Quality of Service (IWQOS), pp. 1–10 (2021)

    Google Scholar 

  12. Lindner, B.S., Auret, L.: Data-driven fault detection with process topology for fault identification. IFAC Proc. 47(3), 8903–8908 (2014)

    Google Scholar 

  13. Maerz, C.: Root cause analysis. https://www.appdynamics.com/blog/product/how-to-monitor-root-cause-analysis/. Accessed 4 Oct 2021

  14. Masood, A., Hashmi, A.: AIOps: predictive analytics & machine learning in operations. In: Cognitive Computing Recipes, pp. 359–382. Apress, Berkeley, CA (2019). https://doi.org/10.1007/978-1-4842-4106-6_7

    Chapter  Google Scholar 

  15. Muñoz, P., de la Bandera, I., Khatib, E.J., Gómez-Andrades, A., Serrano, I., Barco, R.: Root cause analysis based on temporal analysis of metrics toward self-organizing 5g networks. IEEE Trans. Veh. Technol. 66(3), 2811–2824 (2017). https://doi.org/10.1109/TVT.2016.2586143

    Article  Google Scholar 

  16. Senin, P.: Dynamic time war** algorithm review. Inf. Comput. Sci. 855(1–23), 40 (2008)

    Google Scholar 

  17. Shahid, A., White, G., Diuwe, J., Agapitos, A., O’Brien, O.: SLMAD: statistical learning-based metric anomaly detection. In: Hacid, H., et al. (eds.) ICSOC 2020. LNCS, vol. 12632, pp. 252–263. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-76352-7_26

    Chapter  Google Scholar 

  18. Tuncer, O., et al.: ConfEx: towards automating software configuration analytics in the cloud. In: 2018 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN-W), pp. 30–33. IEEE (2018)

    Google Scholar 

  19. Wang, H., Nguyen, P., Li, J., Kopru, S., Zhang, G., Katariya, S., Ben-Romdhane, S.: GRANO: interactive graph-based root cause analysis for cloud-native distributed data platform. Proc. VLDB Endow. 12(12), 1942–1945 (2019)

    Article  Google Scholar 

  20. Yan, H., Breslau, L., Ge, Z., Massey, D., Pei, D., Yates, J.: G-RCA: a generic root cause analysis platform for service quality management in large IP networks. IEEE/ACM Trans. Netw. 20(6), 1734–1747 (2012)

    Article  Google Scholar 

  21. Yin, Z., Ma, X., Zheng, J., Zhou, Y., Bairavasundaram, L.N., Pasupathy, S.: An empirical study on configuration errors in commercial and open source systems. In: Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles, pp. 159–172 (2011)

    Google Scholar 

  22. You, C., Wang, Q., Sun, C.: sBiLSAN: stacked bidirectional self-attention LSTM network for anomaly detection and diagnosis from system logs. In: Arai, K. (ed.) Intelligent Systems and Applications, pp. 777–793 (2022)

    Google Scholar 

  23. Yuan, C., et al.: Automated known problem diagnosis with event traces. In: Proceedings of the 1st ACM SIGOPS/EuroSys European Conference on Computer Systems, vol. 40, pp. 375–388, April 2006

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Gary White .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

White, G., Diuwe, J., Fonseca, E., O’Brien, O. (2022). MMRCA: MultiModal Root Cause Analysis. In: Hacid, H., et al. Service-Oriented Computing – ICSOC 2021 Workshops. ICSOC 2021. Lecture Notes in Computer Science, vol 13236. Springer, Cham. https://doi.org/10.1007/978-3-031-14135-5_14

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-14135-5_14

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-14134-8

  • Online ISBN: 978-3-031-14135-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Navigation