Diagnostic Insight–Based Adaptive Maintenance of Heterogeneous Computing Infrastructures via Advanced Language Reasoning and Orchestrated Containers

Dr. Neema K. Mwakalinga

Authors

Dr. Neema K. Mwakalinga Faculty of Distributed Computing and AI Engineering East African Digital Innovation University, Dar es Salaam, Tanzania

Keywords:

Heterogeneous computing, adaptive maintenance, Kubernetes, large language models

Abstract

Modern distributed computing ecosystems have evolved into highly heterogeneous, multi-layered infrastructures spanning cloud, edge, fog, and IoT environments. While this computing continuum enhances scalability and responsiveness, it introduces significant operational complexity in terms of scheduling, resource allocation, fault management, and maintenance. Traditional static and reactive maintenance strategies are increasingly inadequate for such dynamic environments, particularly under variable workloads and service-level objectives (SLOs). This research proposes a diagnostic insight–based adaptive maintenance framework that leverages advanced language reasoning models integrated with orchestrated container systems to enable proactive, self-healing, and context-aware infrastructure management.

The study synthesizes key principles from distributed scheduling systems (Ousterhout et al., 2013), serverless computing fabrics (Nastic et al., 2022), and continuum-aware architectures (Dustdar et al., 2023), extending them with intelligent diagnostic reasoning mechanisms inspired by post-mortem system intelligence frameworks (Post-Mortem Intelligence for Self-Healing Multi-Cloud Enterprise Applications Using LLMs and Kubernetes, 2026). The proposed approach introduces a layered architecture combining telemetry-driven diagnostics, large language model (LLM)-based reasoning engines, and Kubernetes-based orchestration to continuously interpret system state, predict anomalies, and recommend or execute corrective actions.

Unlike conventional monitoring systems, which primarily detect failures, the proposed framework emphasizes semantic interpretation of system behavior, enabling deeper root-cause analysis and adaptive decision-making. This is particularly relevant in heterogeneous environments where resource variability, hardware differences (e.g., NVIDIA Tesla GPUs), and distributed workload scheduling constraints create non-linear system behaviors. The framework also integrates insights from energy-aware scheduling (Wang et al., 2022) and graph-based scheduling intelligence (Zhao et al., 2021) to optimize performance-efficiency trade-offs.

Experimental reasoning suggests that integrating diagnostic language models with orchestration layers significantly improves fault recovery time, scheduling efficiency, and resource utilization stability. However, challenges remain in ensuring interpretability, reducing inference overhead, and maintaining reliability under high system entropy. The research concludes that diagnostic insight–driven adaptive maintenance represents a promising direction for next-generation autonomous computing infrastructures.

References

Attanasio, G. Ghiani, L. Grandinetti, and F. Guerriero, “Auction algorithms for decentralized parallel machine scheduling,” Parallel Computing, vol. 32, pp. 701–709, Oct. 2006.

P. Beckman, J. Dongarra, N. Ferrier, G. Fox, T. Moore, D. Reed, and M. Beck, “Harnessing the computing continuum for programming our world,” in Fog Computing ( A. Zomaya, A. Abbas, and S. Khan, eds.), pp. 215–230, John Wiley & Sons, Ltd, Apr. 2020.

D. Bermbach, J. Bader, J. Hasenburg, T. Pfandzelter, and L. Thamsen, “AuctionWhisk: Using an auction-inspired approach for function placement in serverless fog platforms,” Software: Practice and Experience, vol. 52, no. 5, pp. 1143–1169, 2022.

Carrión, “Kubernetes Scheduling: Taxonomy, Ongoing Issues and Challenges,” ACM Comput. Surv., vol. 55, pp. 138:1–138:37, Dec. 2022.

V. Casamayor Pujol, A. Morichetta, I. Murturi, P. Kumar Donta, and S. Dustdar, “Fundamental Research Challenges for Distributed Computing Continuum Systems,” Information, vol. 14, Mar. 2023.

Delimitrou, D. Sanchez, and C. Kozyrakis, “Tarcil: Reconciling scheduling speed and quality in large shared clusters,” ACM SoCC 2015 - Proceedings of the 6th ACM Symposium on Cloud Computing, pp. 97–110, Aug. 2015.

S. Dustdar, V. C. Pujol, and P. K. Donta, “On Distributed Computing Continuum Systems,” IEEE Transactions on Knowledge and Data Engineering, vol. 35, pp. 4092–4105, Apr. 2023.

S. K. Saurav and S. Benedict, “A Taxonomy and Survey on Energy-Aware Scientific Workflows Scheduling in Large-Scale Heterogeneous Architecture,” in 2021 6th International Conference on Inventive Computation Technologies (ICICT), pp. 820–826, Jan. 2021.

J. Schleier-Smith, V. Sreekanti, A. Khandelwal, J. Carreira, N. J. Yadwadkar, R. A. Popa, J. E. Gonzalez, I. Stoica, and D. A. Patterson, “What serverless computing is and should become,” Communications of the ACM, vol. 64, pp. 76–84, May 2021.

S. K. Shukla, D. Ghosal, and M. K. Farrens, “Understanding and Lever-aging Cluster Heterogeneity for Efficient Execution of Cloud Services,” in 2021 IEEE 10th International Conference on Cloud Networking (CloudNet), pp. 56–64, Nov. 2021.

S. Nastic, P. Raith, A. Furutanpey, T. Pusztai, and S. Dustdar, “A Serverless Computing Fabric for Edge & Cloud,” in 2022 IEEE 4th International Conference on Cognitive Machine Intelligence (CogMI), pp. 1–12, Dec. 2022.

S. Nastic, T. Pusztai, A. Morichetta, V. C. Pujol, S. Dustdar, D. Vij, and Y. Xiong, “Polaris scheduler: Edge sensitive and slo aware workload scheduling in cloud-edge-iot clusters,” in 2021 IEEE 14th International Conference on Cloud Computing (CLOUD), pp. 206–216, IEEE, 2021.

S. Nastic, A. Morichetta, T. Pusztai, S. Dustdar, X. Ding, D. Vij, and Y. Xiong, “Sloc: Service level objectives for next generation cloud computing,” IEEE Internet Computing, vol. 24, no. 3, pp. 39–50, 2020.

S. Nastic, A. Morichetta, T. Pusztai, V. Casamayor Pujol, S. Dustdar, X. Ding, D. Vij, and Y. Xiong, “A novel middleware for efficiently implementing complex cloud-native slos,” in IEEE 14th International Conference on Cloud Computing (CLOUD), 2021.

S. Nastic, A. Morichetta, T. Pusztai, V. Casamayor Pujol, S. Dustdar, X. Ding, D. Vij, and Y. Xiong, “Slo script: A novel language for implementing complex cloud-native elasticity-driven slos,” in IEEE International Conference on Web Services (ICWS), 2021.

NVIDIA Tesla T4 Tensor Core GPUs for Accelerating Inference. https://www.nvidia.com/en-us/data-center/tesla-t4/ (accessed 2023–02–15).

Nvidia tesla K80. https://www.nvidia.com/en-gb/data-center/tesla-k80/ (accessed 2023–02–14).

NVIDIA Tesla P100: der fortschrittlichste Grafikprozessor fur Rechen-zentren. https://www.nvidia.com/de-de/data-center/tesla-p100/ (accessed 2023–02–15).

NVIDIA Tesla V100. https://www.nvidia.com/en-gb/data-center/tesla-v100/ (accessed 2023–02–15).

K. Ousterhout, P. Wendell, M. Zaharia, and I. Stoica, “Sparrow: distributed, low latency scheduling,” in Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles, SOSP ‘13, ( New York, NY, USA ), pp. 69–84, Association for Computing Machinery, Nov. 2013.

A. K. Kulkarni and B. Annappa, “Context Aware VM Placement Optimization Technique for Heterogeneous IaaS Cloud,” IEEE Access, vol. 7, pp. 89702–89713, 2019.

F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vander-plas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay, “Scikit-learn: Machine learning in Python,” Journal of Machine Learning Research, vol. 12, pp. 2825–2830, 2011.

Q. Luo, S. Hu, C. Li, G. Li, and W. Shi, “Resource Scheduling in Edge Computing: A Survey,” IEEE Communications Surveys & Tutorials, vol. 23, no. 4, pp. 2131–2165, 2021.

M. Raeisi-Varzaneh, O. Dakkak, A. Habbal, and B.-S. Kim, “Resource Scheduling in Edge Computing: Architecture, Taxonomy, Open Issues and Future Research Directions,” IEEE Access, vol. 11, pp. 25329–25350, 2023.

A. Morichetta, V. Casamayor Pujol, S. Nastic, S. Dustdar, D. Vij, Y. Xiong, and Z. Zhang, “PolarisProfiler: A novel metadata-based profiling approach for optimizing resource management in the edge-cloud continnum,” in 2023 18th Annual System of Systems Engineering Conference (SOSE), 2023. Accepted - To be published.

T. Pusztai, S. Nastic, A. Morichetta, V. Casamayor Pujol, S. Dustdar, X. Ding, D. Vij, and Y. Xiong, “A novel middleware for efficiently implementing complex cloud-native slos,” in IEEE 14th International Conference on Cloud Computing (CLOUD), 2021.

T. Pusztai, S. Nastic, A. Morichetta, V. Casamayor Pujol, S. Dustdar, X. Ding, D. Vij, and Y. Xiong, “Slo script: A novel language for implementing complex cloud-native elasticity-driven slos,” in IEEE International Conference on Web Services (ICWS), 2021.

S. Pallewatta, V. Kostakos, and R. Buyya, “Microservices-based IoT Applications Scheduling in Edge and Fog Computing: A Taxonomy and Future Directions,” July 2022. arXiv : 2207.05399 [cs].

Q. Wang, X. Mei, H. Liu, Y.-W. Leung, Z. Li, and X. Chu, “Energy-Aware Non-Preemptive Task Scheduling With Deadline Constraint in DVFS-Enabled Heterogeneous Clusters,” IEEE Transactions on Parallel and Distributed Systems, vol. 33, pp. 4083–4099, Dec. 2022.

Z. Zhao, G. Verma, C. Rao, A. Swami, and S. Segarra, “Distributed scheduling using graph neural networks,” ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, vol. 2021-June, pp. 4720–4724, 2021.

Z. Zhong and R. Buyya, “A Cost-Efficient Container Orchestration Strategy in Kubernetes-Based Cloud Computing Infrastructures with Heterogeneous Resources,” ACM Trans. Internet Technol., vol. 20, pp. 15:1–15:24, Apr. 2020.

Post-Mortem Intelligence for Self-Healing Multi-Cloud Enterprise Applications Using LLMs and Kubernetes. (2026). International Journal of Research and Applied Innovations, 9(1), 13641–13649. https://doi.org/10.15662/IJRAI.2026.0901017.