Strategies for Managing and Querying Distributed Simulation Data for AI

Strategies for Managing and Querying Distributed Simulation Data for AI

Monday, June 22, 2026 9:00 AM to 1:00 PM · 4 hr. (Europe/Berlin)
Hall X6 - 1st Floor
Tutorial
AI Applications powered by HPC TechnologiesApplication Workflows for DiscoveryDigital Twins and MLHigh-Performance Data AnalyticsHPC Simulations enhanced by Machine Learning

Information

One of the critical bottlenecks in the recent HPC-AI convergence is managing the sheer volume and complexity of scientific HPC data that needs to be ingested and used by AI codes. Developing robust AI models requires training on vast ensembles of simulation and experimental data that are often dispersed across multiple files, multiple storage tiers, and geographically distributed facilities. The challenge is no longer just generating data, but effectively indexing, searching, and streaming specific subsets of that data into AI training pipelines without overwhelming storage bandwidth. This full-day tutorial addresses these challenges by introducing a comprehensive framework for organizing, labeling, and accessing scientific datasets. We present strategies to transform raw, distributed files into curated "Campaign Archives"—logical collections of data that provide centralized metadata, statistics, and searchability regardless of the physical location of the files.

Central to this tutorial is the SciCampaign Navigator (SCN), a software framework designed to manage these archives. SCN enables a workflow—or a collaboration of scientists—to interact with dispersed datasets as if they were stored in a single, unified database. We will demonstrate how integrating these archives into AI workflows allows scientists to prepare large-scale simulations for training by facilitating effective query, filtering, and remote access tuned specifically for machine learning tasks. Participants will learn how to bypass the limitations of traditional file-based I/O through advanced methods for selective data access. Because AI training rarely requires every byte of a dataset, we focus on retrieving only the most relevant subsets of data at user-defined accuracy levels.

The tutorial will introduce the participants to several US Department of Energy Exascale Computing Project tools and technologies including 1) ADIOS and HDF5, which provide a publish/subscribe I/O abstraction unifying storage interface capable of staging I/O, integrating compression and handling quantities of interest at the I/O library level, and 2) SCN to facilitate on-demand selective and remote data access to HPC datasets and to multi-resolution images. These tools are integrated together to allow for both file-based and in-memory streaming to efficiently find and manage large datasets for both training and model evaluation. The tutorial will showcase applications from diverse scientific domains (e.g., fusion, combustion, and experimental data) to demonstrate how this organizational framework transforms datasets into assets for domain-aware AI model training.

We invite AI researchers, domain scientists working with simulation or experimental datasets, and HPC scientists and developers aiming to streamline AI data workflows to learn essential tools and insights into managing their large scale data on resources ranging from their laptops to exascale computers. We also invite researchers and developers working with large-scale simulations and experiments to explore this tutorial, contribute to ongoing efforts, and help build comprehensive solutions for integrating AI training with large-scale scientific datasets.
Format
on-site
Targeted Audience
This tutorial is for Application Scientists and Developers who generate or use large scientific datasets on HPC systems to train or infer using AI models. It provides practical techniques and tools to efficiently manage, query, reduce, and integrate data, optimizing AI workflows for both local teams and distributed collaborations.
Beginner Level
40%
Intermediate Level
60%
Prerequesites
The lectures are targeted for beginner and intermediate users in multiple topics. The hands-on exercises are targeted for intermediate to advanced users who have parallel programming experience. The hands-on exercises will use a simulation code in C++ as well as training and plotting scripts in Python that participants will run in the cloud in a Linux environment. Requirements for attendees: Audience members should bring a laptop or tablet with keyboard. They will log into E4S AWS cloud using a browser. They will need to use Linux terminals in the cloud virtual machine.