

Exposing Parallelism in Selected Inversion of Structured Sparse Systems
Wednesday, June 24, 2026 3:45 PM to 5:15 PM · 1 hr. 30 min. (Europe/Berlin)
Foyer D-G - 2nd Floor
Research Poster
Chemistry and Materials ScienceEarth, Climate and Weather ModelingNovel AlgorithmsParallel Numerical Algorithms
Information
Poster is on display and will be presented at the poster pitch session.
Many scientific applications require the solution of large sparse linear systems of the form $Ax=b$. They can be commonly addressed using sparse direct methods based on matrix factorization followed by triangular solves. In typical settings, the number of right-hand sides (RHS) is small compared to the matrix dimension, so that computational cost and memory usage are dominated by the factorization step. However, in application domains such as quantum transport and spatio-temporal Bayesian modeling, the number of RHS may be equal to the order of the system matrix. In this regime, computational cost shifts to the triangular solves, and memory requirements grow quadratically as the solution becomes dense, rendering classical approaches infeasible.
Both application domains share the property that only selected entries of the inverse matrix (or selected components of the solution) are required. This motivates \emph{selected linear algebra} algorithms, which compute the needed quantities without forming dense intermediate results. Moreover, the system matrices arising in these problems typically exhibit structured sparsity patterns: block-tridiagonal matrices, possibly augmented by arrowhead couplings.
By leveraging this structure, it is possible to design algorithms that reduce further both memory footprint and computational cost.
Despite these advantages, existing selected linear algebra methods remain limited in scalability. Even for structured sparsity patterns, classical elimination and traversal orders induce strong block-level sequential dependencies, leading to long critical paths and restricting parallelism. In addition, the lack of distributed-memory formulations prevents scaling beyond a single compute node.
In this work, we develop distributed-memory selected linear algebra algorithms for block-tridiagonal and block-tridiagonal-with-arrowhead sparse matrices, targeting both selected inversion of $AX=I$ and selected solutions of the quadratic matrix equation $AXA^\dagger = B$. Our methods are implemented in the GPU-accelerated `serinv` library and integrated into two large-scale scientific codes: DALIA for spatio-temporal Bayesian inference and QuaTrEx for quantum transport simulations.
The central methodological insight is that the primary obstacle to scalability lies in block-level dependency structures rather than arithmetic intensity. By explicitly analyzing the dependency graphs induced by block operations, we have developed permutation strategies that restructure elimination and computation orders while preserving sparsity and numerical correctness. These reorderings shift sequential dependencies to expose highly parallel sections in the algorithm, reducing algorithmic depth at the cost of increased total work. This work–depth trade-off enables efficient execution on modern distributed and accelerator-based architectures.
We demonstrate the impact of these developments in DALIA and QuaTrEx. In the former case, we fit a co-regional air pollution model over Northern Italy at a scale eight times larger than the previous state of the art, achieving a two-order-of-magnitude speedup. In the latter application, we enable self-consistent quantum transport simulations of devices with up to 84,480 atoms, including electron–electron interactions.
Beyond these results, we are currently developing visual representations as tools for reasoning about selected linear algebra algorithms. By making dependency structures, critical paths, and exposed parallelism explicit, these visual encodings support both the understanding of existing methods and the design of new algorithms as structural assumptions are progressively relaxed toward general sparse matrices.
Contributors:
Many scientific applications require the solution of large sparse linear systems of the form $Ax=b$. They can be commonly addressed using sparse direct methods based on matrix factorization followed by triangular solves. In typical settings, the number of right-hand sides (RHS) is small compared to the matrix dimension, so that computational cost and memory usage are dominated by the factorization step. However, in application domains such as quantum transport and spatio-temporal Bayesian modeling, the number of RHS may be equal to the order of the system matrix. In this regime, computational cost shifts to the triangular solves, and memory requirements grow quadratically as the solution becomes dense, rendering classical approaches infeasible.
Both application domains share the property that only selected entries of the inverse matrix (or selected components of the solution) are required. This motivates \emph{selected linear algebra} algorithms, which compute the needed quantities without forming dense intermediate results. Moreover, the system matrices arising in these problems typically exhibit structured sparsity patterns: block-tridiagonal matrices, possibly augmented by arrowhead couplings.
By leveraging this structure, it is possible to design algorithms that reduce further both memory footprint and computational cost.
Despite these advantages, existing selected linear algebra methods remain limited in scalability. Even for structured sparsity patterns, classical elimination and traversal orders induce strong block-level sequential dependencies, leading to long critical paths and restricting parallelism. In addition, the lack of distributed-memory formulations prevents scaling beyond a single compute node.
In this work, we develop distributed-memory selected linear algebra algorithms for block-tridiagonal and block-tridiagonal-with-arrowhead sparse matrices, targeting both selected inversion of $AX=I$ and selected solutions of the quadratic matrix equation $AXA^\dagger = B$. Our methods are implemented in the GPU-accelerated `serinv` library and integrated into two large-scale scientific codes: DALIA for spatio-temporal Bayesian inference and QuaTrEx for quantum transport simulations.
The central methodological insight is that the primary obstacle to scalability lies in block-level dependency structures rather than arithmetic intensity. By explicitly analyzing the dependency graphs induced by block operations, we have developed permutation strategies that restructure elimination and computation orders while preserving sparsity and numerical correctness. These reorderings shift sequential dependencies to expose highly parallel sections in the algorithm, reducing algorithmic depth at the cost of increased total work. This work–depth trade-off enables efficient execution on modern distributed and accelerator-based architectures.
We demonstrate the impact of these developments in DALIA and QuaTrEx. In the former case, we fit a co-regional air pollution model over Northern Italy at a scale eight times larger than the previous state of the art, achieving a two-order-of-magnitude speedup. In the latter application, we enable self-consistent quantum transport simulations of devices with up to 84,480 atoms, including electron–electron interactions.
Beyond these results, we are currently developing visual representations as tools for reasoning about selected linear algebra algorithms. By making dependency structures, critical paths, and exposed parallelism explicit, these visual encodings support both the understanding of existing methods and the design of new algorithms as structural assumptions are progressively relaxed toward general sparse matrices.
Contributors:
Format
on-demandon-site
