Towards Self-Healing Cloud Infrastructure: Automated Recovery Methods and Their Effectiveness
Oleksandr Shevchenko , Site Reliability Engineer Jacksonville, Florida, USAAbstract
This study analyzes existing strategies for automated recovery within self-healing cloud infrastructures. The research is grounded in a review of findings from previous scientific publications. The analysis demonstrates that intelligent remediation methods can not only reduce downtime but also enhance the economic resilience of cloud infrastructure, paving the way toward fully autonomous, self-healing digital platforms. The scientific contribution of this work lies in the first comparative evaluation of the effectiveness of rule-based approaches, ML-prioritized methods, genetic algorithms, and DQN agents in multi-cloud Kubernetes environments. Its practical significance is reflected in the proposed modern approach of implementing a hybrid pipeline with a DQN-based scheduler, which achieves more than a 70% reduction in downtime and establishes a balance between recovery speed and cost-efficiency in real-world cloud platforms. The insights presented in this study will be particularly valuable to researchers in the field of autonomous distributed systems and cloud infrastructure reliability, especially those engaged in the development and formal verification of self-healing and automated failure correction mechanisms. Furthermore, the analysis of the effectiveness of these techniques holds practical relevance for leading DevOps/PlatformOps architects and SRE specialists seeking to enhance the availability and resilience of critical services through the integration of advanced automated recovery algorithms.
Keywords
self-healing infrastructure, automated remediation, multi-cloud, anomaly, reinforcement learning, DevOps, genetic algorithm, AIOps, MTTR, Kubernetes.
References
Patil R. V. et al. Self Healing Infrastructure System //International Journal of Electrical, Electronics and Computer Systems. – 2025. – Vol. 14 (1). –pp. 13-18.
Syed A. A. M., Anazagasty E. AI-Driven Infrastructure Automation: Leveraging AI and ML for Self-Healing and Auto-Scaling Cloud Environments //International Journal of Artificial Intelligence, Data Science, and Machine Learning. – 2024. – Vol. 5 (1). – pp. 32-43.
Shah H., Patel J. Self-Healing AI: Leveraging Cloud Computing for Autonomous Software Recovery //Revista española de Documentación Científica. – 2022. – Vol. 16 (4). – pp. 180-200.
Devi R. K., Muthukannan M. Self-Healing Fault Tolerance Technique in Cloud Datacenter //2021 6th International Conference on Inventive Computation Technologies (ICICT). – IEEE, 2021. – pp. 731-737.
Khlaisamniang P. et al. Generative Ai For Self-Healing Systems //2023 18th International Joint Symposium on Artificial Intelligence and Natural Language Processing (iSAI-NLP). – IEEE, 2023. – pp. 1-6.
Domingos J. et al. Predicting Cloud Applications Failures from Infrastructure Level Data //2023 53rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN-W). – IEEE, 2023. – pp. 9-16.
Sarvari P. A. et al. Next-Generation Infrastructure and Application Scaling: Enhancing Resilience and Optimizing Resource Consumption //Global Joint Conference on Industrial Engineering and Its Application Areas. – Cham : Springer Nature Switzerland, 2023. – pp. 63-76.
Friesen M., Wisniewski L., Jasperneite J. Machine Learning for Zero-Touch Management in Heterogeneous Industrial Networks-A Review //2022 IEEE 18th International Conference on Factory Communication Systems (WFCS). – IEEE, 2022. – pp. 1-8.
Gheibi O., Weyns D., Quin F. Applying Machine Learning in Self-Adaptive Systems: A Systematic Literature Review //ACM Transactions on Autonomous and Adaptive Systems (TAAS). – 2021. – Vol. 15 (3). – pp. 1-37.
Varma S. C. G. Artificial Intelligence in Cloud Computing: Building Intelligent, Distributed, and Fault-Tolerant Systems //International Journal of AI, BigData, Computational and Management Studies. – 2022. – Vol. 3 (1). – pp. 37-45.
Article Statistics
Copyright License
Copyright (c) 2025 Oleksandr Shevchenko

This work is licensed under a Creative Commons Attribution 4.0 International License.
Authors retain the copyright of their manuscripts, and all Open Access articles are disseminated under the terms of the Creative Commons Attribution License 4.0 (CC-BY), which licenses unrestricted use, distribution, and reproduction in any medium, provided that the original work is appropriately cited. The use of general descriptive names, trade names, trademarks, and so forth in this publication, even if not specifically identified, does not imply that these names are not protected by the relevant laws and regulations.