RESEARCH POSTER AWARD: 1st Place: DisCostiC: Digital Twin Performance Simulations Unlocking Hardware-Software Interplay

RESEARCH POSTER AWARD: 1st Place: DisCostiC: Digital Twin Performance Simulations Unlocking Hardware-Software Interplay

Tuesday, June 10, 2025 3:00 PM to Thursday, June 12, 2025 4:00 PM · 2 days 1 hr. (Europe/Berlin)
Foyer D-G - 2nd floor
Research Poster
Performance Tools and Simulators

Information

Poster is on display and will be presented at the poster pitch session.
We present an evaluation of MPI-parallel distributed-memory applications using DisCostiC, a cross-architecture, full-scale parallel simulation framework designed to study performance behavior in controlled environments. Real-world executions of parallel applications often experience unpredictable behavior due to system noise, application imbalance, and the breakdown of synchronized execution patterns. This desynchronization can lead to overlapping communication and computation, making it difficult to accurately predict performance using traditional metrics. Studying these effects on real systems is limited by uncontrollable variables and the inability to isolate hardware-software interactions cleanly.

DisCostiC addresses these limitations through a model-based approach that does not run application code on real hardware. Instead, it uses application skeletons written in a Domain-Specific Embedded Language (DSEL), which explicitly capture inter-process dependencies and control flow. The framework integrates multiple components: a comprehensive machine model spanning cores, memory hierarchies, nodes, and networks; performance models such as Roofline, Execution-Cache-Memory (ECM), Hockney's, and LogGP; and MPI behavior (e.g., eager vs. rendezvous protocols). Together, these enable the generation of simulated traces that reflect the expected execution timeline, which can be visualized using tools like Google Chromium, ITAC, or Vampir.

We demonstrate DisCostiC's capabilities through two case studies. First, we simulate a 2D memory-bound Jacobi stencil on the heterogeneous Wisteria/BDEC-01 system, with nodes from Odyssey (A64FX) and Aquarius (Ice Lake). Using the WaitIO-MPI wrapper in socket mode over InfiniBand, we observe how inter-cluster latency and desynchronization affect performance. Surprisingly, despite balanced workloads, desynchronization caused by slower communication and idle waves leads to improved bandwidth per process due to reduced contention and increased overlap between computation and communication.

The second showcase compares simulated and actual performance for a Jacobi solver on 10 Ice Lake nodes with varying task counts per NUMA domain. DisCostiC accurately models memory contention and reveals the non-scaling behavior within a single ccNUMA domain. To validate simulation accuracy, we evaluate DisCostiC using proxy kernels from real-world applications, including Chebyshev filter diagonalization, Gauss-Seidel Successive Over-Relaxation (GSSOR), High-Performance Conjugate Gradients (HPCG), and Optical Flow solvers. These benchmarks were run across Intel (Ice Lake, Sapphire Rapids) and non-Intel (A64FX) systems. Across strong and weak scaling experiments, the simulation error consistently remained below 2%, confirming the reliability of the framework.

DisCostiC is efficient for long simulations since the simulation time is usually only a small fraction of the actual runtime. Although very short runs are more efficiently executed directly, the simulator scales effectively and has been tested on configurations involving over 4500 simulated MPI processes.

In summary, DisCostiC enables model-based design-space exploration, making it possible to study complex performance interactions in heterogeneous systems. It offers a reproducible, high-fidelity alternative to trace-based simulation, helping developers understand and optimize the execution characteristics of large-scale parallel applications under various system conditions.

In future work, we plan to extend DisCostiC with energy modeling, more fine-grained bottleneck analysis (e.g., cache), and support for accelerated workloads.
Contributors:
Format
On DemandOn Site