Principles and Practice of Scalable and Distributed Deep Neural Networks Training and Inference

Friday, June 13, 2025 9:00 AM to 1:00 PM · 4 hr. (Europe/Berlin)

Hall Y4 - 2nd floor

Tutorial

AI Applications powered by HPC TechnologiesHW and SW Design for Scalable Machine LearningLarge Language Models and Generative AI in HPCML Systems and Tools

Information

Recent advances in Deep Learning (DL) have led to many exciting challenges and opportunities. Modern DL frame works including TensorFlow, PyTorch, Horovod, and DeepSpeed enable high-performance training, inference, and deployment for various types of Deep Neural Networks (DNNs) such as GPT, BERT, ViT, and ResNet. This tutorial provides an overview of recent trends in DL and the role of cutting-edge hardware architectures and interconnects in moving the field forward. We will also present an overview of different DNN architectures, DL frameworks and DL Training and Inference with special focus on parallelization strategies for model training. We highlight new challenges and opportunities for communication runtimes to exploit high-performance CPU/GPU architectures to efficiently support large-scale distributed training. We also highlight some of our co-design efforts to utilize MPI for large-scale DNN training on cutting-edge CPU and GPU architectures available on modern HPC clusters. Throughout the tutorial, we include several hands-on exercises to enable attendees to gain first-hand experience of running distributed DL training and inference on a modern GPU cluster.

Format

On Site

Targeted Audience

This tutorial targets professionals and newcomers in Machine Learning, Deep Learning, and MPI-based distributed training on HPC clusters. It’s designed for scientists, engineers, researchers, data scientists, developers, students, managers, and administrators working with modern high-performance interconnects.

Beginner Level

50%

Intermediate Level

50%