Distributed Deep Learning on GPU-based Clusters

Friday, June 13, 2025 2:00 PM to 6:00 PM · 4 hr. (Europe/Berlin)

Hall Y4 - 2nd floor

Tutorial

Large Language Models and Generative AI in HPCML Systems and Tools

Information

Deep learning (DL) is rapidly becoming pervasive in almost all areas of computer science, and is even being used to assist computational science simulations and data analysis. A key behavior of these deep neural networks (DNNs) is that they reliably scale i.e., they continuously improve in performance when the number of model parameters and amount of data grow. As the demand for larger, more sophisticated, and more accurate DL models increases, the need for large-scale parallel model training, fine-tuning and inference has become increasingly pressing. Subsequently, in the past few years, several parallel algorithms and frameworks have been developed to parallelize model training on GPU-based platforms. This tutorial will introduce and provide basics of the state-of-the-art in distributed deep learning. We will use large language models (LLMs) as a running example, and teach the audience the fundamentals involved in performing the three essential steps of working with LLMs: i. training an LLM from scratch, ii. continued training/fine-tuning of an LLM from a checkpoint, and iii. inference on a trained LLM. We will cover algorithms and frameworks falling under the purview of data parallelism (PytorchDDP and DeepSpeed), and tensor parallelism (AxoNN).

Contributors:

Siddharth Singh

Format

On Site

Targeted Audience

This tutorial is designed for individuals who have experience in sequential/single GPU model training with a framework like PyTorch or Tensorflow, and want to start doing parallel training.

Beginner Level

50%

Intermediate Level

50%