

Rail Optimized PCIe Topologies for LLMs
Thursday, June 12, 2025 9:00 AM to 9:25 AM · 25 min. (Europe/Berlin)
Hall F - 2nd floor
Research Paper
Composable Disaggregated InfrastructureEmerging Computing TechnologiesInterconnects and Networks
Information
Deep learning (DL)/artificial intelligence (AI) workloads are known to be computationally demanding, and recent advances with technologies like large language models (LLMs) require scaling to large pools of GPUs to hit performance targets.
Often, large HPC/AI workloads will rely on Ethernet or InfiniBand (IB) networking to scale across multi-node clusters once the resources within a node are insufficient.
The challenge is that using multi-node resources is difficult.
Workloads need to adopt technologies like MPI to communicate across nodes, and there are various challenges associated with managing shared file systems and software across a cluster of nodes.
Advances in composable disaggregated infrastructure (CDI) provide an alternative. They allow system designers to scale the system bus to entire racks of resources, allowing multi-node scaling within the convenience of a single node.
However, there are technical differences between scaling system resources and scaling network resources, and the constraints imposed by technologies like PCIe provide an interesting set of trade-offs that need to be considered when designing systems.
In this work, we present a system bus optimized for AI workloads.
Our proposed design leverages a rail-optimized topology to achieve strong performance for LLM applications.
We demonstrate how the same topology and software routing optimization deployed on Ethernet/IB networks can be applied to large-scale system bus deployments.
Our proposed design can achieve 3.7x better collective performance compared to a more traditional system bus design, and application-level evaluations demonstrate how intelligent topology design can provide a speedup of up to 1.8x in LLM inference and a performance improvement of up to 12\% in LLM training throughput.
Often, large HPC/AI workloads will rely on Ethernet or InfiniBand (IB) networking to scale across multi-node clusters once the resources within a node are insufficient.
The challenge is that using multi-node resources is difficult.
Workloads need to adopt technologies like MPI to communicate across nodes, and there are various challenges associated with managing shared file systems and software across a cluster of nodes.
Advances in composable disaggregated infrastructure (CDI) provide an alternative. They allow system designers to scale the system bus to entire racks of resources, allowing multi-node scaling within the convenience of a single node.
However, there are technical differences between scaling system resources and scaling network resources, and the constraints imposed by technologies like PCIe provide an interesting set of trade-offs that need to be considered when designing systems.
In this work, we present a system bus optimized for AI workloads.
Our proposed design leverages a rail-optimized topology to achieve strong performance for LLM applications.
We demonstrate how the same topology and software routing optimization deployed on Ethernet/IB networks can be applied to large-scale system bus deployments.
Our proposed design can achieve 3.7x better collective performance compared to a more traditional system bus design, and application-level evaluations demonstrate how intelligent topology design can provide a speedup of up to 1.8x in LLM inference and a performance improvement of up to 12\% in LLM training throughput.
Format
On DemandOn Site
Documents & Links
Read the Full Paper Open Access at IEEE Xplore!



