Performance Analysis of CPU Offloading for Imbalanced Workloads on Coupled Architectures

Performance Analysis of CPU Offloading for Imbalanced Workloads on Coupled Architectures

Wednesday, June 24, 2026 3:45 PM to 5:15 PM · 1 hr. 30 min. (Europe/Berlin)
Foyer D-G - 2nd Floor
Research Poster
AI Applications powered by HPC TechnologiesHeterogeneous System ArchitecturesML Systems and FrameworksPerformance Measurement

Information

Poster is on display and will be presented at the poster pitch session.
Modern large language models (LLMs), especially Mixture-of-Experts (MoE) architectures, demand enormous memory and compute resources. Although GPUs deliver high computational throughput, their limited memory capacity restricts model scale. Coupled CPU–GPU architectures with high-bandwidth interconnects and unified address spaces provide a promising way to extend effective memory capacity. However, CPU offloading introduces performance challenges due to bandwidth asymmetry: even advanced CPU–GPU links remain far slower than GPU HBM. Despite this, there is little empirical understanding of how to efficiently exploit such architectures, particularly when offloading model parameters or computation to CPU memory.

This work investigates CPU offloading strategies for imbalanced MoE workloads on coupled architectures, using the NVIDIA GH200 Grace Hopper Superchip as a case study. MoE routing is highly skewed: expert utilization varies by layer, and a large fraction of tokens are processed by a subset of experts, while others are rarely used. Our analysis shows that roughly 90% of tokens are routed to only 65% of experts, leaving the remaining experts lightly loaded. We hypothesize that these infrequently activated experts can be placed in CPU memory to reduce GPU memory pressure with little performance impact.

To evaluate this, we conduct a detailed performance study of Grouped GEMM, a core MoE kernel and a likely bottleneck under expert offloading, using PyTorch’s implementation. Latency is measured under three routing patterns: uniform, synthetic skew reflecting CPU–GPU bandwidth differences, and empirical routing derived from MMLU data. Experts are offloaded in ascending order of routing probability so that those processing fewer tokens are moved first.

Results show that performance impact depends strongly on how many tokens are handled by offloaded experts. When token counts are small, runtime is dominated by CPU weight-loading overhead, leading to nearly constant or mildly increasing latency. When token counts grow, computation becomes dominant, and execution time rises steeply and approximately linearly. Larger batch sizes significantly reduce the relative slowdown from offloading. Under empirical routing with an activation rate of 0.9, up to 31.25% of experts can be offloaded with only a 1.5× slowdown at 65,536 tokens, and 50% can be offloaded with a 1.4× slowdown at 524,288 tokens. As workloads become more compute-bound, the cost of CPU weight transfers is amortized.

The key contribution is a systematic empirical characterization of expert CPU offloading on a modern coupled architecture using realistic routing distributions. We identify batch size and per-expert token counts as primary factors governing performance trade-offs, and show that latency impact can be predicted from routing statistics and workload scale. This enables more principled expert placement strategies.

This study is exploratory and focuses on isolated Grouped GEMM kernels rather than full MoE pipelines. Future work includes end-to-end evaluations, incorporation of additional components such as activations and all-to-all communication, and development of runtime policies for efficient expert placement across CPU and GPU memory.
Contributors:
Format
on-demandon-site