FP8 Makes Prefill Collective-Bound on H100: Replication Beats Deep TP for Short-Output Guardrail Serving

FP8 Makes Prefill Collective-Bound on H100: Replication Beats Deep TP for Short-Output Guardrail Serving

Wednesday, June 24, 2026 3:45 PM to 5:15 PM · 1 hr. 30 min. (Europe/Berlin)
Foyer D-G - 2nd Floor
Research Poster
HW and SW Design for Scalable Machine LearningLarge Language Models and Generative AI in HPCMixed PrecisionML Model OptimizationOptimizing for Energy and Performance

Information

Poster is on display and will be presented at the poster pitch session.
Large Language Models (LLMs) deployed as guardrails form a distinct serving regime characterized by prefill-dominance—long contexts with minimal output. In this setting, Time-to-First-Token (TTFT) is critical. While FP8 inference on NVIDIA H100 GPUs significantly accelerates dense matrix computation (GEMM), it shifts the bottleneck toward inter-GPU synchronization.

This study evaluates the trade-off between scale-up (monolithic Tensor Parallelism, TP=8) and scale-out (replication, two disjoint TP=4 instances) for Llama-3.3-70B-Instruct on a single 8×H100 NVSwitch node. Using vLLM with FP8 quantization, we evaluate this trade-off and show that replication (TP=4×2) outperforms monolithic deployment for throughput in prefill-heavy workloads.

At an input length of 2048 tokens, replication improves throughput by 27.3% over TP=8, a gap that widens compared to BF16 (18.1%), as FP8 exposes synchronization overheads. Profiling shows that TP=8 becomes collective-bound, spending 35.1% of prefill time in NCCL All-Reduce, while TP=4 reduces this to 25.5%.

Operationally, while monolithic TP=8 offers lower mean TTFT under light load, replication provides superior robustness at saturation, sustaining higher request rates under a fixed P99 latency budget. We conclude that, for FP8-enabled guardrails on H100 nodes, scaling out with moderate TP degrees is more effective than maximizing TP depth for prefill-dominant workloads.
Format
on-demandon-site