Energy-Efficient GPU Allocation and Frequency Management in Exascale Computing Systems

Energy-Efficient GPU Allocation and Frequency Management in Exascale Computing Systems

Wednesday, June 11, 2025 9:25 AM to 9:50 AM · 25 min. (Europe/Berlin)
Hall F - 2nd floor
Research Paper
Energy ManagementHeterogeneous System ArchitecturesResource Management and SchedulingRuntime Systems for HPCSustainability and Energy Efficiency

Information

Heterogeneous CPU-GPU architectures dominate high-performance computing (HPC), but their increasing power demands pose significant challenges at exascale levels. Dynamic voltage and frequency scaling (DVFS) is the most efficient technique for managing power consumption, but its effectiveness is hindered by hardware variability and aging effects, which introduce power and performance heterogeneity across identical GPUs. With that in mind, we propose V-FORGE, a variability and frequency-aware optimization framework for improving GPU energy efficiency in HPC environments. V-FORGE integrates power-performance variability profiling, a Random Forest classifier for GPU frequency prediction, and dynamic optimization to select the most suitable GPU node and frequency for each application. Through experiments on 400 AMD MI250X GPUs across fifteen HPC applications, V-FORGE improves the performance and energy efficiency (represented by the energy-delay product -- EDP) by up to 41% compared to default execution of HPC applications and achieves over 73% of solutions within the Top-1% of optimal configurations found via exhaustive search. Additionally, V-FORGE demonstrates adaptability in dynamic environments by efficiently managing scenarios where the ideal GPU node is unavailable.
Contributors:
Format
On DemandOn Site

Log in

See all the content and easy-to-use features by logging in or registering!