

Two Worlds Collide: Trustworthiness and Sustainability for Coupled HPC and AI Simulation
Thursday, June 12, 2025 11:30 AM to 12:30 PM · 1 hr. (Europe/Berlin)
Hall E - 2nd floor
Birds of a Feather
AI Applications powered by HPC TechnologiesApplication Workflows for DiscoveryHPC Simulations enhanced by Machine LearningOptimizing for Energy and PerformancePerformance Measurement
Information
The "Two Worlds Collide" Birds of a Feather (BoF) series focuses on the experiences, challenges, and opportunities faced by laboratories and vendors in integrating deep learning (DL) and artificial intelligence (AI) with high-performance computing (HPC) for advanced simulation research. This fourth installment, titled "Trustworthiness and Sustainability for Converged HPC and AI Simulation" aims to promote a trustworthy and assured integration between established HPC simulation and the rapidly evolving DL ecosystem. Furthermore, this BoF seeks to address the emerging sustainability concerns associated with the verification and validation of converged HPC and AI simulations.
The convergence of HPC and DL has unlocked exciting new possibilities in simulation, spanning from molecular-scale applications to climate modeling and fluid dynamics. This convergence has created the need for a new programming paradigm and environment that seamlessly integrates simulation applications with DL frameworks, through techniques like in-memory coupling and inference serving.
Incorporating industry-developed DL frameworks into the HPC programming environment introduces several challenges, including correctness testing, portability, and understanding energy requirements for verification. HPC and DL have distinct standards and philosophies for software development, encompassing established practices for correctness testing, build systems, optimization strategies, framework and programming language choices. For the coupled HPC simulation and DL ecosystem, cohesive strategies are still in development which necessitate collaboration between HPC and DL practitioners.
Addressing these concerns is crucial to ensure the scientific readiness of integrated programming environments. A fundamental question arises: how can communities collaborate to create unified converged HPC and DL programming environments that are reliable, thoroughly tested, and energy-efficient? In particular, what metrics and frameworks should we use for reproducibility and correctness assessment, and how do we sustainably consolidate the energy utilization for verification? Additionally, how can we build off of existing strengths and practices in the HPC and DL communities while fostering productive collaboration between industry and academia?
This BoF brings together representatives from government research facilities, such as Argonne National Laboratory (ALCF), Oak Ridge National Laboratory (OLCF), Swiss National Supercomputing Centre and Jülich Supercomputing Centre. From industry, AMD and Groq will be present. The participants will showcase their ongoing work at their respective centers, providing a context for further discussions and collaborations.
This BoF aims to cultivate a growing community interested in the convergence of HPC Simulation and the AI/DL stack. By the end of this BoF, several outcomes are anticipated:
1. An understanding of the current capabilities researchers can leverage, allowing newcomers to the field to witness the possibilities offered by the current state of converged DL and HPC technology.
2. Summarization of the challenges and pain points of the current ecosystem, alongside an enumeration of desired outcomes that participants believe would significantly advance the current state-of-the-art research.
3. Identification of the gap between current capabilities and desired outcomes as a baseline for gauging progress over the next several years.
4. Action steps to facilitate progress towards said desired outcomes, with a focus on international collaboration and joint research.
Organizers:
The convergence of HPC and DL has unlocked exciting new possibilities in simulation, spanning from molecular-scale applications to climate modeling and fluid dynamics. This convergence has created the need for a new programming paradigm and environment that seamlessly integrates simulation applications with DL frameworks, through techniques like in-memory coupling and inference serving.
Incorporating industry-developed DL frameworks into the HPC programming environment introduces several challenges, including correctness testing, portability, and understanding energy requirements for verification. HPC and DL have distinct standards and philosophies for software development, encompassing established practices for correctness testing, build systems, optimization strategies, framework and programming language choices. For the coupled HPC simulation and DL ecosystem, cohesive strategies are still in development which necessitate collaboration between HPC and DL practitioners.
Addressing these concerns is crucial to ensure the scientific readiness of integrated programming environments. A fundamental question arises: how can communities collaborate to create unified converged HPC and DL programming environments that are reliable, thoroughly tested, and energy-efficient? In particular, what metrics and frameworks should we use for reproducibility and correctness assessment, and how do we sustainably consolidate the energy utilization for verification? Additionally, how can we build off of existing strengths and practices in the HPC and DL communities while fostering productive collaboration between industry and academia?
This BoF brings together representatives from government research facilities, such as Argonne National Laboratory (ALCF), Oak Ridge National Laboratory (OLCF), Swiss National Supercomputing Centre and Jülich Supercomputing Centre. From industry, AMD and Groq will be present. The participants will showcase their ongoing work at their respective centers, providing a context for further discussions and collaborations.
This BoF aims to cultivate a growing community interested in the convergence of HPC Simulation and the AI/DL stack. By the end of this BoF, several outcomes are anticipated:
1. An understanding of the current capabilities researchers can leverage, allowing newcomers to the field to witness the possibilities offered by the current state of converged DL and HPC technology.
2. Summarization of the challenges and pain points of the current ecosystem, alongside an enumeration of desired outcomes that participants believe would significantly advance the current state-of-the-art research.
3. Identification of the gap between current capabilities and desired outcomes as a baseline for gauging progress over the next several years.
4. Action steps to facilitate progress towards said desired outcomes, with a focus on international collaboration and joint research.
Organizers:
Format
On Site
Targeted Audience
This BoF is suitable for beginner and intermediate audiences interested in understanding and discussing the current state of verification and validation (V&V), and sustainability in converged high performance computing (HPC) and artificial intelligence (AI). The target audience of this BoF is students and researchers at universities and research institutes.
Speakers

Oscar Hernandez Mendoza
senior staff memberOak Ridge National Laboratory
Murali Emani
Computer ScientistArgonne National Laboratory
Siddhisanket Raskar
Assistant Computer ScientistArgonne National Laboratory
Sanjif Shanmugavelu
ML Research EngineerGroq Inc; Maxeler Technologies, a Groq CompanyAS
Ada Sedova
Associate Research ScientistOak Ridge National LaboratoryMT
Mathieu Taillefumier
Computer ScientistSwiss National Supercomputing Centre, ETH Zurich
Stefan Kesselheim
Head of SDL Applied Machine Learning & AI Consultant teamJülich Supercompting CentreJAE
J. Austin Ellis
APU application architectAMD, Lawrence Livermore National Laboratory
