Proxy-Based Thermal and Acoustic Evaluation of Cloud GPUs for AI Training Workloads
Karan Lulla , Senior Board Test Engineer, NVIDIA, CA, USA. Reena Chandra , Tools and Automation Engineer, Amazon, CA, USA. Karthik Sirigiri , Software Developer, Redmane Technology, IL, USAAbstract
The use of cloud-based Graphics Processing Units (GPUs) to train and deploy Deep Learning models has grown rapidly in importance, with the demand to learn more about their thermal and acoustic behavior under real-world workloads. A normal cloud cannot make direct telemetry like temperature, fan speed, or acoustic emissions. To overcome such shortcomings, this study quantifies GPU workloads' thermal and acoustic output with a proxy-based model derived from available metrics such as GPU utilization, memory provisioning, power consumption, and empirical Thermal Design Power (TDP) values. They compare the two typical AI tasks, BERT on natural language processing and YOLOv5 on real-time object detection, on Colab-based NVIDIA GPUs (T4, V100, P100). The nvidia-smi was used to gather runtime logs, and the specifications of the GPUs have been obtained in the form of public Kaggle datasets. Proxy statistics, including TDP-per-MHz and thermal load (Power * Duration), were calculated to model heat loss due to workload. To measure the degree of acoustic impact, a threshold of TDP was applied to approximate the level of fan-driven acoustics. The visual analytics, such as boxplot, scatterplot, and bubble plot, demonstrated certain considerable distinctions in the stress patterns of GPUs: the BERT jobs demanded extremely high cumulative thermal load and medium acoustic effect, whereas the YOLOv5 demonstrated bursty power footprint and substantial acoustic imprint on high-TDP GPUs. The findings reveal that proxy estimation is reproducible, interpretable, and a lightweight substitute for determining the GPU thermal and acoustic behavior of a machine used in the cloud setting. Such a solution facilitates making thermal-aware schedules, optimizing the infrastructure, and deploying AI models with reduced energy consumption in multi-tenant GPU environments.
Keywords
Cloud GPUs, Thermal Load Estimation, Acoustic Classification, Proxy Metrics
References
Artificial Intelligence, Machine Learning, and Deep Learning for Advanced Business Strategies: A Review | Partners Universal International Innovation Journal [Internet]. [cited 2025 Jun 28]. Available from: https://puiij.com/index.php/research/article/view/143
High Performance Computing for Understanding Natural Language: Computer Science & IT Book Chapter | IGI Global Scientific Publishing [Internet]. [cited 2025 Jun 28]. Available from: https://www.igi-global.com/chapter/high-performance-computing-for-understanding- natural-language/273400
Mustafa F, Gilbert A. Scalable Data Architectures for Generative AI: A Comparison of AWS and Google Cloud Solutions [Internet]. Unpublished; 2024 [cited 2025 Jun 28]. Available from: https://rgdoi.net/10.13140/RG.2.2.26378.07364
Rafsanjani H, Marwazi A, Sitompul D. Thermal Management and Power Optimization in Modern CPU and GPU Architectures.
Katal A, Dahiya S, Choudhury T. Energy efficiency in cloud computing data center: a survey on hardware technologies. Cluster Comput. 2022 Feb 1;25(1):675–705.
CPUs Versus GPUs | SpringerLink [Internet]. [cited 2025 Jun 28]. Available from: https://link.springer.com/chapter/10.1007/978-981-97-9251-1_9
Thermal intelligence: exploring AI’s role in optimizing thermal systems – a review | Interactions [Internet]. [cited 2025 Jun 28]. Available from: https://link.springer.com/article/10.1007/s10751-024-02122-6
A Survey of Cloud-Based GPU Threats and Their Impact on AI, HPC, and Cloud Computing.
A systematic review of scheduling approaches on multi-tenancy cloud platforms - ScienceDirect [Internet]. [cited 2025 Jun 28]. Available from: https://www.sciencedirect.com/science/article/abs/pii/S0950584920302214
Ajayi R. Integrating IoT and cloud computing for continuous process optimization in real time systems. Int J Res Publ Rev. 2025 Jan;6(1):2540–58
Real-Time Thermal Map Characterization and Analysis for Commercial GPUs with AI Workloads | IEEE Conference Publication | IEEE Xplore [Internet]. [cited 2025 Jun 28]. Available from: https://ieeexplore.ieee.org/abstract/document/11014443
An Overview of Thermal and Mechanical Design, Control, and Testing of the World’s Most Powerful and Fastest Supercomputer | J. Electron. Packag. | ASME Digital Collection [Internet]. [cited 2025 Jun 28]. Available from: https://asmedigitalcollection.asme.org/electronicpackaging/article- abstract/143/1/011005/1082291/An-Overview-of-Thermal-and-Mechanical-Design
Design, Operation and Maintenance of Direct and Indirect Evaporative Cooling Systems in Data Center Thermal Management - ProQuest [Internet]. [cited 2025 Jun 28]. Available from: https://www.proquest.com/openview/eed632faea3362b05f921b8213a5b9ab/1?pq- origsite=gscholar&cbl=18750&diss=y
Harnessing Machine Learning in Dynamic Thermal Management in Embedded CPU-GPU Platforms | ACM Transactions on Design Automation of Electronic Systems [Internet]. [cited 2025 Jun 28]. Available from: https://dl.acm.org/doi/full/10.1145/3708890
Bagai R. Comparative Analysis of AWS Model Deployment Services. IJCTT. 2024 May 30;72(5):102–10.
Zhang J, Zhang W, Xu J. Bandwidth-efficient multi-task AI inference with dynamic task importance for the Internet of Things in edge computing. Computer Networks. 2022 Oct 24;216:109262.
Ansar W, Goswami S, Chakrabarti A. A Survey on Transformers in NLP with Focus on Efficiency [Internet]. arXiv; 2024 [cited 2025 Jun 28]. Available from: http://arxiv.org/abs/2406.16893
A survey on data center cooling systems: Technology, power consumption modeling and control strategy optimization - ScienceDirect [Internet]. [cited 2025 Jun 28]. Available from: https://www.sciencedirect.com/science/article/abs/pii/S1383762121001739
Knebel FP. Designing and implementing digital twins with cloud and edge computing: challenges and opportunities. Projetando e implementando Gêmeos Digitais com computação em nuvem e de borda: desafios e oportunidades [Internet]. 2024 [cited 2025 Jun 28]; Available from: https://lume.ufrgs.br/handle/10183/276593
Advances in Numerical Modeling for Heat Transfer and Thermal Management: A Review of Computational Approaches and Environmental Impacts [Internet]. [cited 2025 Jun 28]. Available from: https://www.mdpi.com/1996-1073/18/5/1302
SOUND AND NOISE: MEASUREMENT AND DESIGN GUIDANCE – HANDBOOK OF HUMAN FACTORS AND ERGONOMICS - Wiley Online Library [Internet]. [cited 2025 Jun28].Availablefrom: https://onlinelibrary.wiley.com/doi/abs/10.1002/9781119636113.ch18
GPU Devices for Safety-Critical Systems: A Survey | ACM Computing Surveys [Internet]. [cited 2025 Jun 28]. Available from: https://dl.acm.org/doi/abs/10.1145/3549526
Chowdhury U, Rodriguez J, Tradat M, Soud Q, Wallace S, O”Brien D, et al. Acoustics Analysis of Air and Hybrid Cooled Data Center. In: 2024 23rd IEEE Intersociety Conference on Thermal and Thermomechanical Phenomena in Electronic Systems (ITherm) [Internet]. 2024 [cited 2025 Jun 28]. p. 1–11. Available from: https://ieeexplore.ieee.org/abstract/document/10709368
Distributed Cloud Computing Infrastructure Management [Internet]. [cited 2025 Jun 28]. Available from: https://www.scirp.org/journal/paperinformation?paperid=143462
TAPAS: Thermal- and Power-Aware Scheduling for LLM Inference in Cloud Platforms | Proceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2 [Internet]. [cited 2025 Jun 28]. Available from: https://dl.acm.org/doi/abs/10.1145/3676641.3716025
Enhancing Reservoir Modeling and Simulation Through Artificial Intelligence and Machine Learning: A Smart Proxy Modeling Approach - ProQuest [Internet]. [cited 2025 Jun 28]. Availablefrom: https://www.proquest.com/openview/f3886ad9ad2f556a032981db194fcfa4/1?pq- origsite=gscholar&cbl=18750&diss=y
Ganesh P, Chen Y, Lou X, Khan MA, Yang Y, Sajjad H, et al. Compressing Large-Scale Transformer-Based Models: A Case Study on BERT. Transactions of the Association for Computational Linguistics. 2021 Sep 21;9:1061–80
Synchronizing Object Detection: Applications, Advancements and Existing Challenges | IEEE Journals & Magazine | IEEE Xplore [Internet]. [cited 2025 Jun 28]. Available from: https://ieeexplore.ieee.org/abstract/document/10499817
Aalborg University, Liu J. Automatic Analysis of People in Thermal Imagery [Internet] [Ph.d]. Aalborg University; 2022 [cited 2025 Jun 28]. Available from: https://vbn.aau.dk/en/publications/automatic-analysis-of-people-in-thermal-imagery
AI-Enabling Workloads on Large-Scale GPU-Accelerated System: Characterization, Opportunities, and Implications | IEEE Conference Publication | IEEE Xplore [Internet]. [cited 2025 Jun 28]. Available from: https://ieeexplore.ieee.org/abstract/document/9773216/
Tapis: An API Platform for Reproducible, Distributed Computational Research | SpringerLink [Internet]. [cited 2025 Jun 28]. Available from: https://link.springer.com/chapter/10.1007/978-3-030-73100-7_61
Article Statistics
Downloads
Copyright License
Copyright (c) 2025 Karan Lulla Lulla, Reena Chandra Chandra, Karthik Sirigiri Sirigiri

This work is licensed under a Creative Commons Attribution 4.0 International License.
Authors retain the copyright of their manuscripts, and all Open Access articles are disseminated under the terms of the Creative Commons Attribution License 4.0 (CC-BY), which licenses unrestricted use, distribution, and reproduction in any medium, provided that the original work is appropriately cited. The use of general descriptive names, trade names, trademarks, and so forth in this publication, even if not specifically identified, does not imply that these names are not protected by the relevant laws and regulations.