Abstract
Root-cause analysis (RCA) is a crucial task in software system maintenance, where system logs play an essential role in capturing system behaviours and describing failures. Automatic RCA approaches are desired, which face the challenge that the knowledge model (KM) extracted from system logs can be faulty when logs are not correctly representing some information. When unrepresented information is required for successful RCA, it is called missing information (MI). Although much work has focused on automatically finding root causes of system failures based on the given logs, automated RCA with MI remains under-explored. This paper proposes using the Abduction, Belief Revision and Conceptual Change (ABC) system to automate RCA after repairing the system’s KM to contain MI. First, we show how ABC can be used to discover MI and repair the KM. Then we demonstrate how ABC automatically finds and repairs root causes. Based on automated reasoning, ABC considers the effect of changing a cause when repairing a system failure: the root cause is the one whose change leaves the fewest failures. Although ABC outputs multiple possible solutions for experts to choose from, it hugely reduces manual work in discovering MI and analysing root causes, especially in large-scale system management, where any reduction in manual work is very beneficial. This is the first application of an automatic theory repair system to RCA tasks: KM is not only used, it will be improved because our approach can guide engineers to produce KM/higher-quality logs that contain the spotted MI, thus improving the maintenance of complex software systems.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
The damage caused by MI in RCA is described in Fig. 2 and further discussed in the next section.
- 2.
ABC’s code is available on GitHub https://github.com/XuerLi/ABC_Datalog.
- 3.
Kee** \(\implies \) is required by the inference of refutation.
- 4.
For example, a triple represents an alarm about a system failure.
- 5.
A cause may be missing while its logical consequence exists in a KM, e.g., only the latter is recorded in the log.
References
Ceri, S., Gottlob, G., Tanca, L.: Logic Programming and Databases. Surveys in Computer Science, Springer, Berlin (1990). https://doi.org/10.1007/978-3-642-83952-8
Chapman, A., et al.: Dataset search: a survey. VLDB J. 29(1), 251–272 (2020)
Cherrared, S., Imadali, S., Fabre, E., Gössler, G.: SFC self-modeling and active diagnosis. IEEE Trans. Network Serv. Manage. 18, 2515–2530 (2021)
Dalal, S., Chhillar, R.S.: Empirical study of root cause analysis of software failure. ACM SIGSOFT Software Engineering Notes 38(4), 1–7 (2013)
Gallier, J.: SLD-Resolution and Logic Programming. Chapter 9 of Logic for Computer Science: Foundations of Automatic Theorem Proving (2003). originally published by Wiley 1986
He, P., Zhu, J., He, S., Li, J., Lyu, M.R.: An evaluation study on log parsing and its use in log mining. In: 2016 46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), pp. 654–661. IEEE (2016)
He, S., Zhu, J., He, P., Lyu, M.R.: Loghub: a large collection of system log datasets towards automated log analytics. ar**v preprint ar**v:2008.06448 (2020)
Jia, T., Chen, P., Yang, L., Li, Y., Meng, F., Xu, J.: An approach for anomaly diagnosis based on hybrid graph model with logs for distributed services. In: 2017 IEEE International Conference on Web Services (ICWS), pp. 25–32. IEEE (2017)
Kowalski, R.A., Kuehner, D.: Linear resolution with selection function. Artif. Intell. 2, 227–60 (1971)
Li, X.: Automating the Repair of Faulty Logical Theories. Ph.D. thesis, School of Informatics, University of Edinburgh (2021)
Lin, Q., Zhang, H., Lou, J.G., Zhang, Y., Chen, X.: Log clustering based problem identification for online service systems. In: 2016 IEEE/ACM 38th International Conference on Software Engineering Companion (ICSE-C), pp. 102–111. IEEE (2016)
Lu, J., Dousson, C., Krief, F.: A self-diagnosis algorithm based on causal graphs. In: The Seventh International Conference on Autonomic and Autonomous Systems, ICAS, vol. 2011 (2011)
Pfenning, F.: Datalog. Lecture 26, 15–819K: Logic Programming (2006). https://www.cs.cmu.edu/~fp/courses/lp/lectures/26-datalog.pdf
Qiu, J., Du, Q., Yin, K., Zhang, S.L., Qian, C.: A causality mining and knowledge graph based method of root cause diagnosis for performance anomaly in cloud applications. Appl. Sci. 10(6), 2166 (2020)
Shima, K.: Length matters: clustering system log messages using length of words. ar**v preprint ar**v:1611.03213 (2016)
Smaill, A., Li, X., Bundy, A.: ABC repair system for Datalog-like theories. In: KEOD, pp. 333–340 (2018)
Solé, M., Muntés-Mulero, V., Rana, A.I., Estrada, G.: Survey on models and techniques for root-cause analysis. ar**v preprint ar**v:1701.08546 (2017)
Urbonas, M., Bundy, A., Casanova, J., Li, X.: The use of max-sat for optimal choice of automated theory repairs. In: Bramer, M., Ellis, R. (eds.) SGAI 2020. LNCS (LNAI), vol. 12498, pp. 49–63. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-63799-6_4
Wang, F., et al.: LEKG: a system for constructing knowledge graphs from log extraction. In: The 10th International Joint Conference on Knowledge Graphs (2021)
Zawawy, H., Kontogiannis, K., Mylopoulos, J.: Log filtering and interpretation for root cause analysis. In: 2010 IEEE International Conference on Software Maintenance, pp. 1–5. IEEE (2010)
Zhou, Q., Gray, A.J., McLaughlin, S.: Seanet-towards a knowledge graph based autonomic management of software defined networks. ar**v preprint ar**v:2106.13367 (2021)
Zhu, R., et al.: TREAT: automated construction and maintenance of probabilistic knowledge bases from logs (extended abstract). In: The 8th Annual Conference on machine Learning, Optimization and Data Science (LOD) (2022)
Acknowledgment
The authors would like to thank Huawei for supporting the research and providing data on which this paper was based under grant CIENG4721/LSC. Also we gratefully acknowledge UKRI grant EP/V026607/1 and the support of ELIAI (The Edinburgh Laboratory for Integrated Artificial Intelligence) EPSRC (grant no EP/W002876/1). Thanks are also due to Zhenhao Zhou for the valuable discussions around network software systems. In addition, anonymous reviewers also gave us very useful feedback that improved the quality of this paper.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Li, X. et al. (2023). ABC in Root Cause Analysis: Discovering Missing Information and Repairing System Failures. In: Nicosia, G., et al. Machine Learning, Optimization, and Data Science. LOD 2022. Lecture Notes in Computer Science, vol 13810. Springer, Cham. https://doi.org/10.1007/978-3-031-25599-1_26
Download citation
DOI: https://doi.org/10.1007/978-3-031-25599-1_26
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-25598-4
Online ISBN: 978-3-031-25599-1
eBook Packages: Computer ScienceComputer Science (R0)