As High-Performance Computing (HPC) systems grow in complexity and scale, they generate vast volumes of diverse data, including hardware telemetry, performance counters, system logs, and troubleshooting tickets. Efficiently analyzing these datasets is critical for optimizing system performance, ensuring resilience, and advancing system design. Machine Learning (ML) and Artificial Intelligence (AI) techniques have demonstrated transformative potential in extracting actionable insights from such large-scale datasets.
Building on the success of our previous editions—first launched atCUG2024, followed by ISC-HPC 2024 and SC2024—this 4th BoF will highlight cutting-edge AI and ML applications in HPC workload analysis. At ISC2024, we engaged over 100 participants, fostering in-depth discussions and receiving valuable feedback, which we further addressed at subsequent events.
This year, we will spotlight research advancements at NERSC and other national labs including the NREL and BSC (Barcelona SuperComputing Center) alongside industry insights, with invited speakers such as Carlee Joe-Wong (Carnegie Mellon University), Doug Jacobson (Microsoft) and Thorsten Kurth from NVIDIA. Our interactive format will feature lightning talks on key topics—data collection infrastructure, system monitoring for optimal operations, analysis techniques, and application-focused anomaly detection—followed by guided discussions to encourage knowledge exchange and collaboration.
This BoF aims to bridge gaps between academic, government, and industry researchers, fostering interdisciplinary conversations to tackle shared challenges in managing and optimizing the next generation of HPC systems.
Organizers: