SpMV for the Cerebras Wafer-Scale Engine

SpMV for the Cerebras Wafer-Scale Engine

Wednesday, June 24, 2026 3:45 PM to 5:15 PM · 1 hr. 30 min. (Europe/Berlin)
Foyer D-G - 2nd Floor
Research Poster
Emerging Computing TechnologiesNovel AlgorithmsParallel Numerical AlgorithmsPost Moore Computing

Information

Poster is on display and will be presented at the poster pitch session.
Our poster presents a Sparse Matrix-Vector Product (SpMV) implementation for the AI-specialized Cerebras Wafer-Scale Engine (WSE). The algorithm supports arbitrary sparsity patterns without matrix reordering. We present the implementation, initial scaling results, and discuss improvements.

Sparse Basic Linear Algebra Subprograms (BLAS) routines are a cornerstone of high-performance scientific computing, providing highly optimized, standardized operations for manipulating sparse matrices and vectors. One particularly important BLAS routine is the SpMV of a given sparse matrix and a vector. BLAS routines, especially the SpMV, are essential not only in classical applications such as finite element analysis (FEA) and computational fluid dynamics (CFD), where large sparse systems of linear equations dominate the computational workload, but also in a growing number of specialized and emerging algorithms. On the hardware side, today's and tomorrow's high-performance computing architectures are increasingly leaning towards artificial intelligence and machine learning workloads, including systems based on domain-specific accelerators such as those developed by Cerebras and other AI-focused vendors. While these clusters are often marketed for AI applications, computing centers extend their usage beyond deep learning: they will also accelerate traditional simulation-driven workloads, hybrid workflows, and data-intensive tasks that benefit from the parallelism and memory hierarchies optimized for AI. One of such devices is the Cerebras Wafer-Scale Engine (WSE). It is a massively parallel computing accelerator device that holds up to 900,000 cores and is made from a single wafer. Its cores have access only to very fast but small local memory and communicate via a 2D mesh-grid Network on the Chip.

We implement SpMV to support direct input of CSR matrices and use a computation scheme that executes part of the computation on the host and uses the WSE only for an embarrassingly parallel part. Furthermore, we present initial weak- and strong-scaling results, analyzing the contribution of individual components of our algorithm to the overall runtime. From our results, we derive bottlenecks and discuss initial optimization strategies. Our results show that this implementation is heavily host-WSE communication-bound, and therefore, we suggest optimization strategies to reduce the impact of the communication. In this effort, the current algorithm serves as a stepping stone toward supporting a full SpMV and, at a later stage, a full Krylov solver on the WSE, which we are actively researching.
Contributors:
Format
on-demandon-site