

Scalable Tensor Network Contraction for Quantum Circuit Simulation-An MPI-Based Approach for Efficient Multi-CPU Deployment
Wednesday, June 24, 2026 3:45 PM to 5:15 PM · 1 hr. 30 min. (Europe/Berlin)
Foyer D-G - 2nd Floor
Research Poster
Integration of Quantum Computing and HPCParallel Numerical AlgorithmsPerformance Tools and SimulatorsSimulating Quantum Systems
Information
Poster is on display and will be presented at the poster pitch session.
Quantum circuit simulation is essential for algorithm development and verification, yet it is increasingly bottlenecked on supercomputers by data movement rather than arithmetic. Direct tensor contractions incur irregular memory access, poor cache reuse, and costly communication when the exponential-size state is distributed, so CPU-only scaling often falls short without accelerators or specialised kernels.
We present an MPI-based quantum circuit simulator that makes distributed CPU simulation practical by reformulating k-qubit gate contractions as dense matrix multiplications and by treating data layout and communication as first-class design concerns. For each gate, we permute the global qubit index ordering so the target qubits are grouped contiguously, reshape the n-qubit state into a 2^k × 2^(n−k) matrix, and apply the gate via vendor-optimised BLAS GEMM. This increases arithmetic intensity and improves locality, mapping the dominant compute to standard HPC kernels.
Our key contribution is a communication-aware distributed permutation (transpose) mechanism that efficiently realigns the distributed state between gate layers. With an MPI decomposition in which each rank holds a local sub-state, we exchange only the necessary contiguous amplitude blocks to realise a new qubit ordering, cleanly separating (i) permutation planning, (ii) structured MPI data exchange, and (iii) local GEMM-based gate application. This converts an irregular contraction workflow into a predictable HPC pattern: bulk communication followed by compute-efficient dense linear algebra, with clear opportunities for overlap and performance modelling.
We evaluate on the Setonix supercomputer (Pawsey Supercomputing Research Centre) and observe that benefits grow with problem size and concurrency: while small circuits see limited gains, larger simulations increasingly favour the matrix-centric approach. We have successfully simulated up to 38-qubit circuits on 128 CPU nodes, demonstrating scalable CPU-only execution in a production MPI+OpenMP environment. Ongoing work extends to deeper circuits and noisy models, explores GPU-aware MPI for heterogeneous systems, and integrates with emerging quantum workflow stacks (e.g., CUDA-Q, Amazon Braket, AutoQASM) to support hybrid pipelines and reproducible benchmarking.
Contributors:
Quantum circuit simulation is essential for algorithm development and verification, yet it is increasingly bottlenecked on supercomputers by data movement rather than arithmetic. Direct tensor contractions incur irregular memory access, poor cache reuse, and costly communication when the exponential-size state is distributed, so CPU-only scaling often falls short without accelerators or specialised kernels.
We present an MPI-based quantum circuit simulator that makes distributed CPU simulation practical by reformulating k-qubit gate contractions as dense matrix multiplications and by treating data layout and communication as first-class design concerns. For each gate, we permute the global qubit index ordering so the target qubits are grouped contiguously, reshape the n-qubit state into a 2^k × 2^(n−k) matrix, and apply the gate via vendor-optimised BLAS GEMM. This increases arithmetic intensity and improves locality, mapping the dominant compute to standard HPC kernels.
Our key contribution is a communication-aware distributed permutation (transpose) mechanism that efficiently realigns the distributed state between gate layers. With an MPI decomposition in which each rank holds a local sub-state, we exchange only the necessary contiguous amplitude blocks to realise a new qubit ordering, cleanly separating (i) permutation planning, (ii) structured MPI data exchange, and (iii) local GEMM-based gate application. This converts an irregular contraction workflow into a predictable HPC pattern: bulk communication followed by compute-efficient dense linear algebra, with clear opportunities for overlap and performance modelling.
We evaluate on the Setonix supercomputer (Pawsey Supercomputing Research Centre) and observe that benefits grow with problem size and concurrency: while small circuits see limited gains, larger simulations increasingly favour the matrix-centric approach. We have successfully simulated up to 38-qubit circuits on 128 CPU nodes, demonstrating scalable CPU-only execution in a production MPI+OpenMP environment. Ongoing work extends to deeper circuits and noisy models, explores GPU-aware MPI for heterogeneous systems, and integrates with emerging quantum workflow stacks (e.g., CUDA-Q, Amazon Braket, AutoQASM) to support hybrid pipelines and reproducible benchmarking.
Contributors:
Format
on-demandon-site
