

RESEARCH POSTER AWARD: 2nd Place: ZeroSum: User Space Utility for Monitoring Hardware and Software Resources for HPC
Tuesday, June 10, 2025 3:00 PM to Thursday, June 12, 2025 4:00 PM · 2 days 1 hr. (Europe/Berlin)
Foyer D-G - 2nd floor
Research Poster
Compiler and Tools for Parallel ProgrammingPerformance MeasurementPerformance Tools and SimulatorsResource Management and SchedulingSystem and Performance Monitoring
Information
Poster is on display and will be presented at the poster pitch session.
High Performance Computing (HPC) systems are large, heterogeneous, sophisticated – and are therefore so complicated that they are difficult to use efficiently. HPC users are allocated finite compute time on systems and yet have no portable utility to confirm that they are effectively utilizing the allocation at their disposal. ZeroSum is a user space library that is launched within the process space of the HPC application. For each application process, it will monitor the application threads, MPI communication, and the hardware resources assigned to them – including CPU cores and/or hardware threads, memory usage and GPU utilization. Supported systems include Linux based operating systems, as well as GPUs from NVIDIA (using the NVML library), AMD (using the ROCm-SMI library) and Intel (using the SYCL API). Host side monitoring utilizes the virtual /proc filesystem and therefore is portable to all Linux systems. When integrated with the hwloc library, visualizations of utilization data can be generated from included Python post-processing scripts. Automatic deadlock detection is available, and ZeroSum will generate call stacks from all ranks, merge them, and visualize the resulting merged call stacks to help diagnose where expected behavior diverged (similar to STAT/Cray-STAT). Monitoring overhead is less than 0.5%. As future work, we plan enhancements including monitoring of network, filesystem, and other devices, automated statistical analysis of monitoring data, and in situ monitoring support through asynchronous aggregation services.
Contributors:
High Performance Computing (HPC) systems are large, heterogeneous, sophisticated – and are therefore so complicated that they are difficult to use efficiently. HPC users are allocated finite compute time on systems and yet have no portable utility to confirm that they are effectively utilizing the allocation at their disposal. ZeroSum is a user space library that is launched within the process space of the HPC application. For each application process, it will monitor the application threads, MPI communication, and the hardware resources assigned to them – including CPU cores and/or hardware threads, memory usage and GPU utilization. Supported systems include Linux based operating systems, as well as GPUs from NVIDIA (using the NVML library), AMD (using the ROCm-SMI library) and Intel (using the SYCL API). Host side monitoring utilizes the virtual /proc filesystem and therefore is portable to all Linux systems. When integrated with the hwloc library, visualizations of utilization data can be generated from included Python post-processing scripts. Automatic deadlock detection is available, and ZeroSum will generate call stacks from all ranks, merge them, and visualize the resulting merged call stacks to help diagnose where expected behavior diverged (similar to STAT/Cray-STAT). Monitoring overhead is less than 0.5%. As future work, we plan enhancements including monitoring of network, filesystem, and other devices, automated statistical analysis of monitoring data, and in situ monitoring support through asynchronous aggregation services.
Contributors:
Format
On DemandOn Site


