LLM-Based Intelligent Fault Detection and Self-Healing Framework for Microservices
Vivek Arora , Social Discovery Group, Team Lead ML Engineer Tbilisi, GeorgiaAbstract
Microservice-based cloud applications have become the preferred architectural style for modern digital services due to their scalability, flexibility, and ease of deployment. However, the distributed nature of microservices introduces significant operational challenges, including service crashes, network failures, resource exhaustion, configuration anomalies, and dependency-related faults. These failures can propagate rapidly across interconnected services, resulting in performance degradation, service interruptions, and increased operational costs. Traditional fault management strategies are mainly dependent on predefined rules and manual action, which are not suitable for dynamic cloud environments. In view of these challenges, this study puts forward an intelligent cloud fault management framework utilizing Llama 3.1 for automatic fault detection, fault root cause analysis, and self-healing. The framework combines real-time monitoring with anomaly detection, contextual reasoning and autonomous remediation to enhance the operational resilience. Experimental evaluation demonstrates substantial improvements over conventional and semi-automated approaches, achieving 97.5% fault detection accuracy, 96.4% root cause identification accuracy, and a 95.7% self-healing success rate while reducing Mean Time to Recovery to 5.4 minutes. The findings demonstrate the capabilities of LLM to support scalable, reliable, and autonomous cloud fault managements.
Keywords
Microservices, Cloud Computing, Fault Management, Fault Detection, Root Cause Analysis, Self-Healing Systems, Large Language Models (LLMs).
References
N. D. Bhandarwar, “A Mathematical Framework for Explainable and Adversarially Robust IDS Using ML for Large-Scale Enterprise and Cloud Systems,” Int. J. Appl. Math., vol. 39, no. 1, 2025.
V. Sharma, “Cloud-Native 5G Deployments: Kubernetes and Microservices in Telco Networks,” Int. J. Innov. Res. Eng. Multidiscip. Phys. Sci., vol. 10, no. 3, pp. 1–8, May 2022, doi: 10.37082/IJIRMPS.v10.i3.232706.
S. Jain and D. Jain, “Artifact Comparison Analyzer: Evaluating Microservice Build Metrics for Performance and Efficiency Improvements,” in 2026 IEEE International Conference on AI Engineering and Innovations (AIEI), 2026, pp. 1–6. doi: 10.1109/AIEI69164.2026.11497468.
S. Gupta, G. Asirvatharaj, and R. T. Talluri, “AI-Powered Intelligent System through context aware log scrutiny for Anomaly detection,” in 2026 18th International Conference on Communication Systems and Networks (COMSNETS), 2026, pp. 1303–1307.
M. Parikh, A. A. Soni, S. M. Shah, and A. R. Jha, “Big Data Workload Profiling for Energy-Aware Cloud Resource Management,” Jan. 2026, doi: 10.48550/arXiv.2601.11935.
V. K. Bollu, “Threat Landscape in Artificial Intelligence Systems: Taxonomy, Attack Vectors and Security Implications,” World J. Adv. Res. Rev., vol. 29, no. 1, pp. 285–294, 2026, doi: 10.30574/wjarr.2026.29.1.0007.
R. K. Gadiraju, “A Novel Machine Learning Method for Fault Prediction and Reliability in Software Systems,” Int. J. Sci. Res. Sci. Eng. Technol., vol. 12, no. 3, pp. 1226–1238, Jun. 2025, doi: 10.32628/IJSRSET2512163.
V. Methuku, S. Kamatala, P. Naayini, and P. R. Vontela, “From Ethical Principles to Technical Safeguards: A Unified Framework for Safe and Human-Centered Artificial Intelligence,” Am. Int. J. Comput. Sci. Technol., vol. 4, no. 5, pp. 26–34, Sep. 2022, doi: 10.63282/3117-5481/AIJCST-V4I5P103.
B. P. Singh, “Securing the Boundary: Trust Context Separation in Privileged AI Agent Systems,” Comput. Fraud Secur., vol. 2026, no. 1, pp. 998–1009, 2026, doi: 10.5281/zenodo.19487302.
J. B. Mehta, “Designing Self-Healing Automation Frameworks for Flaky CI Environments,” in 2025 International Conference on Computer and Applications (ICCA), IEEE, Dec. 2025, pp. 1–7. doi: 10.1109/ICCA66035.2025.11430985.
N. Kolli, J. W. Sajja, and A. Nerella, “Building Secure AI Agents for Autonomous Data Access in Compliance/Regulatory-Critical Environments,” Comput. Fraud Secur., vol. 2024, no. 9, pp. 363–373, Sep. 2024, doi: 10.52710/cfs.746.
T. P. Patel, A. K. Elengovan, V. Ranganathan, M. Parikh, and D. Kole, “Self-Healing AI Systems Using Multi-Agent Learning,” in 2026 International Seminar on Intelligent Business and Edge-Computing Research (ISIBER), 2026, pp. 7–12. doi: 10.1109/ISIBER68248.2026.11470173.
Y. Jin, Z. Yang, J. Liu, and X. Xu, “Anomaly detection and early warning mechanism for intelligent monitoring systems in multi-cloud environments based on LLM,” in 2025 5th International Symposium on Computer Technology and Information Science (ISCTIS), 2025, pp. 167–170.
K. Gandhi, P. Verma, V. Govindarajan, and R. Sonani, “Advancing Software Maintenance with LLMs and Cloud-Based Deep Learning,” in International Conference on AI and Robotics, 2025, pp. 388–406. doi: 10.1007/978-3-032-05548-4_31.
B. Krishnan, A. Thaneeru, R. Lingam, and S. K. Kaata, “The Future of Cloud Data Engineering: Multi-Tenant, Multi-Region Pipelines Leveraging LLM-Powered Data Governance,” in 2025 1st International Conference on Advancement in Futuristic Technologies (ICAFT), IEEE, Dec. 2025, pp. 1–8. doi: 10.1109/ICAFT66710.2025.11453308.
M. R. C. Mukkolakkal, “InfraLLM: A Generic Large Language Model Framework for Production-Grade Microservice Auto-Scaling in Cloud Infrastructure,” Int. J. Sci. Res. Mod. Technol., vol. 4, no. 11, pp. 113–123, 2025, doi: 10.38124/ijsrmt.v4i11.1023.
R. Chen et al., “GRACE: A Strategic LLM-Enhanced Graph Reinforcement Learning Framework for Adaptive Fault Recovery in Microservice Systems,” in Service-Oriented Computing, Springer Nature Singapore, 2026, pp. 155–170. doi: 10.1007/978-981-95-5012-8_12.
C. Wang, T. Yuan, C. Hua, L. Chang, X. Yang, and Z. Qiu, “Integrating Large Language Models with Cloud-Native Observability for Automated Root Cause Analysis and Remediation,” in Proceedings of the 2025 3rd International Conference on Artificial Intelligence, Systems and Network Security, Nov. 2025, pp. 327–334. doi: 10.1145/3797161.3797213.
V. Sawalkar, N. More, S. Jagadale, B. Shendkar, P. Chandre, and C. Mhaske, “Self-Healing Cloud Infrastructure: Leveraging AI for Fault Detection and Recovery,” in Data Science and Big Data Analytics, Springer Nature Switzerland, 2025, pp. 22–33.
V. Avgerinos, K. Ramantas, L. Alonso, and C. Verikoukis, “ARM: Autonomous Remediation and Management With LLM Agents for Intent-Driven Control,” IEEE Internet Things J., vol. 13, no. 9, pp. 18305–18315, 2025, doi: 10.1109/JIOT.2025.3648858.
A. Arulappan, A. Mahanti, K. Passi, T. Srinivasan, R. Naha, and G. Raja, “DQN Approach for Adaptive Self-Healing of VNFs in Cloud-Native Network,” IEEE Access, vol. 12, pp. 34489–34504, 2024, doi: 10.1109/ACCESS.2024.3365635.
A. J. Diego, “AI-Powered Autonomous Microservices : A Self-Healing Approach using Machine Learning Techniques,” Int. J. Artif. Intell. Appl., vol. 2, no. 1, pp. 96–101, 2023.
K. M. Alsaif, A. A. Albeshri, M. A. Khemakhem, and F. E. Eassa, “Multimodal Large Language Model-Based Fault Detection and Diagnosis in Context of Industry 4.0,” Electronics, vol. 13, no. 24, p. 4912, Dec. 2024, doi: 10.3390/electronics13244912.
T. Song, W. Zhang, S.-N. Lang, and H. Yan, “LLM-Enhanced Intelligent Fault Diagnosis and Self-Healing Framework for Cloud Computing Systems,” Jan. 09, 2026. doi: 10.20944/preprints202601.0630.v2.
T. P. Patel, S. R. K. V. Bayyavarapu, V. Soni, R. Purushothaman, G. B. Thokala, and V. Ranganathan, “LLMDebug: Prompt-Engineered Large Language Models for Automated Root Cause Analysis in Microservices Architectures,” in 2026 International Conference on Advances in Artificial Intelligence and Machine Learning (AAIML), IEEE, Mar. 2026, pp. 373–380. doi: 10.1109/AAIML67890.2026.11498111.
Download and View Statistics
Copyright License
Copyright (c) 2026 Vivek Arora

This work is licensed under a Creative Commons Attribution 4.0 International License.
Authors retain the copyright of their manuscripts, and all Open Access articles are disseminated under the terms of the Creative Commons Attribution License 4.0 (CC-BY), which licenses unrestricted use, distribution, and reproduction in any medium, provided that the original work is appropriately cited. The use of general descriptive names, trade names, trademarks, and so forth in this publication, even if not specifically identified, does not imply that these names are not protected by the relevant laws and regulations.

Articles
| Open Access |
DOI: