Rigorous Evaluation of LLM Components in HPC Research

Rigorous Evaluation of LLM Components in HPC Research

Monday, June 22, 2026 2:00 PM to 6:00 PM · 4 hr. (Europe/Berlin)
Hall X12 - 1st Floor
Tutorial
Large Language Models and Generative AI in HPC

Information

Large language models (LLMs) are increasingly used as components inside HPC research workflows. They are used from code generation and translation, to agentic tool use for debugging, profiling, and experiment orchestration. While these systems can accelerate development, they also introduce new challenges for rigorous evaluation: outputs are stochastic, behavior is sensitive to prompts and configuration, commercial models change silently over time, and the true contribution of the LLM is often confounded with many other experimental factors. As a result, a growing number of papers report results using LLMs in their study without sufficient transparency, suitable baselines, or statistically sound measurements.

This tutorial will teach participants a practical framework for rigorous evaluation and reporting of LLM components in HPC research. We cover what must be reported for reproducibility, how to design experiments around stochastic components, and when and how to incorporate human validation. We then focus on HPC-specific evaluation and the unique challenges that arise in using LLMs in HPC research. Throughout, we highlight common pitfalls and provide guidance and templates that participants can directly apply to their own projects and papers.
Format
on-site
Targeted Audience
The tutorial is aimed at HPC developers, researchers, and students who use or benefit from AI in their workflows. This includes those who are experts at using LLMs in their work or beginners.
Beginner Level
90%
Intermediate Level
10%
Prerequesites
While we will provide hands-on demos to better understand the concepts, they are not essential to learning the material. Only a modern laptop with internet connection is needed to follow along with the tutorial contents.