FP8 Makes Prefill Collective-Bound on H100: TP vs Replication for Short-Output LLM Guardrails

FP8 Makes Prefill Collective-Bound on H100: TP vs Replication for Short-Output LLM Guardrails

Wednesday, June 24, 2026 3:45 PM to 5:15 PM · 1 hr. 30 min. (Europe/Berlin)
Foyer D-G - 2nd Floor
Research Poster
HW and SW Design for Scalable Machine LearningLarge Language Models and Generative AI in HPCMixed PrecisionML Model OptimizationOptimizing for Energy and Performance

Information

Poster is on display and will be presented at the poster pitch session.
Large Language Models (LLMs) deployed as "guardrails" represent a distinct serving regime characterized by prefill-dominance—processing long contexts with minimal output generation. In this setting, Time-to-First-Token (TTFT) is critical. While FP8 inference on NVIDIA H100 GPUs significantly accelerates dense matrix computation (GEMM), it risks shifting the bottleneck to inter-GPU synchronization. This study evaluates the trade-off between scale-up (monolithic Tensor Parallelism, TP=8) and scale-out (replication, two disjoint TP=4 instances) for Llama-3.3-70B-Instruct on a single 8xH100 NVSwitch node.Using vLLM with FP8 quantization, we demonstrate that replication (TP=4 x 2) consistently outperforms monolithic deployment for prefill-heavy workloads. At an input length of 2048 tokens, replication improves throughput by 27.3% over TP=8, a gap that widens compared to BF16 (18.1%) as FP8 exposes synchronization overheads. Detailed profiling reveals that TP=8 enters a collective-bound regime, spending 35.1% of prefill time in NCCL All-Reduce, whereas TP=4 reduces this to 25.5%. Operationally, while monolithic TP=8 offers lower mean TTFT under light load, replication provides superior robustness at saturation, maintaining lower P99 tail latency and higher sustainable request rates. We conclude that for FP8-enabled guardrails on H100 nodes, scaling out with moderate TP degrees is more effective than maximizing TP depth.
Format
on-demandon-site