Recent advances in Deep Learning (DL) have led to many exciting challenges and opportunities. Modern DL frame works including TensorFlow, PyTorch, Horovod, and DeepSpeed enable high-performance training, inference, and deployment for various types of Deep Neural Networks (DNNs) such as GPT, BERT, ViT, and ResNet. This tutorial provides an overview of recent trends in DL and the role of cutting-edge hardware architectures and interconnects in moving the field forward. We will also present an overview of different DNN architectures, DL frameworks and DL Training and Inference with special focus on parallelization strategies for model training. We highlight new challenges and opportunities for communication runtimes to exploit high-performance CPU/GPU architectures to efficiently support large-scale distributed training. We also highlight some of our co-design efforts to utilize MPI for large-scale DNN training on cutting-edge CPU and GPU architectures available on modern HPC clusters. Throughout the tutorial, we include several hands-on exercises to enable attendees to gain first-hand experience of running distributed DL training and inference on a modern GPU cluster.
Targeted Audience
This tutorial targets professionals and newcomers in Machine Learning, Deep Learning, and MPI-based distributed training on HPC clusters. It’s designed for scientists, engineers, researchers, data scientists, developers, students, managers, and administrators working with modern high-performance interconnects.