Abstract
Cloud systems are becoming increasingly complex and more difficult for human operators to manage due to the scale and interconnectedness of microservices. Increased observability and anomaly detection are able to alert when something has gone wrong, however a fault can propagate throughout the cloud leading to a large number of alerts. It is difficult for operators to manage these large number of alerts and to differentiate between the symptoms that have propagated due to the fault and the actual root cause. In this paper we present a MultiModal Root Cause Analysis algorithm called MMRCA. This algorithm leverages data from traces, topology, configurations and metrics to accurately predict the root cause of the fault. Our approach consists of a three step pipeline of topology reduction, metric causality and metric reduction. The experimental results show that MMRCA can accurately detect the root cause in a number of different data sets, while maintaining an efficient use of resources and scaling to a large deployment.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Bejerano, Y., Breitbart, Y., Garofalakis, M., Rastogi, R.: Physical topology discovery for large multisubNet networks. In: IEEE INFOCOM 2003. Twenty-second Annual Joint Conference of the IEEE Computer and Communications Societies. vol. 1, pp. 342–352 (2003)
Beyer, B., Jones, C., Petoff, J., Murphy, N.R.: Site Reliability Engineering: How Google Runs Production Systems. O’Reilly Media, Inc., Sebastopol (2016)
Bogatinovski, J., et al.: Artificial intelligence for it operations (AIOPS) workshop white paper. ar**v preprint ar**v:2101.06054 (2021)
Cai, Z., Li, W., Zhu, W., Liu, L., Yang, B.: A real-time trace-level root-cause diagnosis system in Alibaba datacenters. IEEE Access 7, 142692–142702 (2019)
Dang, Y., Lin, Q., Huang, P.: AIOPS: real-world challenges and research innovations. In: 2019 IEEE/ACM 41st International Conference on Software Engineering: Companion Proceedings (ICSE-Companion). pp. 4–5. IEEE (2019)
Dynatrace: root cause analysis. https://www.dynatrace.com/support/help/how-to-use-dynatrace/problem-detection-and-analysis/problem-analysis/root-cause-analysis/. Accessed 4 Apr 2022
Gao, J., Li, P., Chen, Z., Zhang, J.: A survey on deep learning for multimodal data fusion. Neural Comput. 32(5), 829–864 (2020)
Josefsson, T.: Root-cause analysis through machine learning in the cloud (2017)
Lahat, D., Adali, T., Jutten, C.: Multimodal data fusion: An overview of methods, challenges, and prospects. Proc. IEEE 103(9), 1449–1477 (2015)
Levy, S., et al.: Predictive and adaptive failure mitigation to avert production cloud VM interruptions. In: 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20). pp. 1155–1170, November 2020
Li, Z., et al.: Practical root cause localization for microservice systems via trace analysis. In: 2021 IEEE/ACM 29th International Symposium on Quality of Service (IWQOS), pp. 1–10 (2021)
Lindner, B.S., Auret, L.: Data-driven fault detection with process topology for fault identification. IFAC Proc. 47(3), 8903–8908 (2014)
Maerz, C.: Root cause analysis. https://www.appdynamics.com/blog/product/how-to-monitor-root-cause-analysis/. Accessed 4 Oct 2021
Masood, A., Hashmi, A.: AIOps: predictive analytics & machine learning in operations. In: Cognitive Computing Recipes, pp. 359–382. Apress, Berkeley, CA (2019). https://doi.org/10.1007/978-1-4842-4106-6_7
Muñoz, P., de la Bandera, I., Khatib, E.J., Gómez-Andrades, A., Serrano, I., Barco, R.: Root cause analysis based on temporal analysis of metrics toward self-organizing 5g networks. IEEE Trans. Veh. Technol. 66(3), 2811–2824 (2017). https://doi.org/10.1109/TVT.2016.2586143
Senin, P.: Dynamic time war** algorithm review. Inf. Comput. Sci. 855(1–23), 40 (2008)
Shahid, A., White, G., Diuwe, J., Agapitos, A., O’Brien, O.: SLMAD: statistical learning-based metric anomaly detection. In: Hacid, H., et al. (eds.) ICSOC 2020. LNCS, vol. 12632, pp. 252–263. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-76352-7_26
Tuncer, O., et al.: ConfEx: towards automating software configuration analytics in the cloud. In: 2018 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN-W), pp. 30–33. IEEE (2018)
Wang, H., Nguyen, P., Li, J., Kopru, S., Zhang, G., Katariya, S., Ben-Romdhane, S.: GRANO: interactive graph-based root cause analysis for cloud-native distributed data platform. Proc. VLDB Endow. 12(12), 1942–1945 (2019)
Yan, H., Breslau, L., Ge, Z., Massey, D., Pei, D., Yates, J.: G-RCA: a generic root cause analysis platform for service quality management in large IP networks. IEEE/ACM Trans. Netw. 20(6), 1734–1747 (2012)
Yin, Z., Ma, X., Zheng, J., Zhou, Y., Bairavasundaram, L.N., Pasupathy, S.: An empirical study on configuration errors in commercial and open source systems. In: Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles, pp. 159–172 (2011)
You, C., Wang, Q., Sun, C.: sBiLSAN: stacked bidirectional self-attention LSTM network for anomaly detection and diagnosis from system logs. In: Arai, K. (ed.) Intelligent Systems and Applications, pp. 777–793 (2022)
Yuan, C., et al.: Automated known problem diagnosis with event traces. In: Proceedings of the 1st ACM SIGOPS/EuroSys European Conference on Computer Systems, vol. 40, pp. 375–388, April 2006
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 Springer Nature Switzerland AG
About this paper
Cite this paper
White, G., Diuwe, J., Fonseca, E., O’Brien, O. (2022). MMRCA: MultiModal Root Cause Analysis. In: Hacid, H., et al. Service-Oriented Computing – ICSOC 2021 Workshops. ICSOC 2021. Lecture Notes in Computer Science, vol 13236. Springer, Cham. https://doi.org/10.1007/978-3-031-14135-5_14
Download citation
DOI: https://doi.org/10.1007/978-3-031-14135-5_14
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-14134-8
Online ISBN: 978-3-031-14135-5
eBook Packages: Computer ScienceComputer Science (R0)