Abstract
A systolic array provides an alternative computing paradigm to the von Neumann architecture. Though its hardware implementation has failed as a paradigm to design integrated circuits in the past, we are now discovering that the systolic array as a software virtualization layer can lead to an extremely scalable execution paradigm. To demonstrate this scalability, in this paper, we design and implement a 3D virtual systolic array to compute a tile QR decomposition of a tall-and-skinny dense matrix. Our implementation is based on a state-of-the-art algorithm that factorizes a panel based on a tree-reduction. Freed from the constraint of a planar layout, we present a three-dimensional virtual systolic array architecture for this algorithm. Using a runtime developed as a part of the Parallel Ultra Light Systolic Array Runtime (PULSAR) project, we demonstrate on a Cray-XT5 machine how our virtual systolic array can be mapped to a large-scale machine and obtain excellent parallel performance. This is an important contribution since such a QR decomposition is used, for example, to compute a least squares solution of an overdetermined system, which arises in many scientific and engineering problems.
Original language | English |
---|---|
Article number | 1442004 |
Journal | Parallel processing letters |
Volume | 24 |
Issue number | 4 |
DOIs | |
State | Published - Dec 22 2014 |
Funding
A systolic array provides an alternative computing paradigm to the von Neumann architecture. Though its hardware implementation has failed as a paradigm to design integrated circuits in the past, we are now discovering that the systolic array as a software virtualization layer can lead to an extremely scalable execution paradigm. To demonstrate this scalability, in this paper, we design and implement a 3D virtual systolic array to compute a tile QR decomposition of a tall-and-skinny dense matrix. Our implementation is based on a state-of-the-art algorithm that factorizes a panel based on a tree-reduction. Freed from the constraint of a planar layout, we present a three-dimensional virtual systolic array architecture for this algorithm. Using a runtime developed as a part of the Parallel Ultra Light Systolic Array Runtime (PULSAR) project, we demonstrate on a Cray-XT5 machine how our virtual systolic array can be mapped to a large-scale machine and obtain excellent parallel performance. This is an ∗This work is supported in part by grant #SHF-1117062: “Parallel Unified Linear algebra with Systolic ARrays” (PULSAR) from the National Science Foundation (NSF). Work was funded in part by the Ministry of Education and Science of the Russian Federation, Agreement N 14.607.21.0006 (unique identifier RFMEFI57714X0020). The authors would like to thank the National Institute for Computational Sciences (NICS) for a generous time allocation on the Kraken supercomputer. This work is supported in part by grant #SHF-1117062: “Parallel Unified Linear algebra with Systolic ARrays” (PULSAR) from the National Science Foundation (NSF). Work was funded in part by the Ministry of Education and Science of the Russian Federation,
Funders | Funder number |
---|---|
National Science Foundation | |
Ministry of Education and Science of the Russian Federation | RFMEFI57714X0020 |
Keywords
- QR decomposition
- Systolic array
- scheduling runtime
- threading, message-passing, dataflow