Design and implementation of a large scale tree-based QR decomposition using a 3D virtual systolic array and a lightweight runtime

Ichitaro Yamazaki, Jakub Kurzak, Piotr Luszczek, Jack Dongarra

Research output: Contribution to journalArticlepeer-review

1 Scopus citations

Abstract

A systolic array provides an alternative computing paradigm to the von Neumann architecture. Though its hardware implementation has failed as a paradigm to design integrated circuits in the past, we are now discovering that the systolic array as a software virtualization layer can lead to an extremely scalable execution paradigm. To demonstrate this scalability, in this paper, we design and implement a 3D virtual systolic array to compute a tile QR decomposition of a tall-and-skinny dense matrix. Our implementation is based on a state-of-the-art algorithm that factorizes a panel based on a tree-reduction. Freed from the constraint of a planar layout, we present a three-dimensional virtual systolic array architecture for this algorithm. Using a runtime developed as a part of the Parallel Ultra Light Systolic Array Runtime (PULSAR) project, we demonstrate on a Cray-XT5 machine how our virtual systolic array can be mapped to a large-scale machine and obtain excellent parallel performance. This is an important contribution since such a QR decomposition is used, for example, to compute a least squares solution of an overdetermined system, which arises in many scientific and engineering problems.

Original languageEnglish
Article number1442004
JournalParallel processing letters
Volume24
Issue number4
DOIs
StatePublished - Dec 22 2014

Funding

A systolic array provides an alternative computing paradigm to the von Neumann architecture. Though its hardware implementation has failed as a paradigm to design integrated circuits in the past, we are now discovering that the systolic array as a software virtualization layer can lead to an extremely scalable execution paradigm. To demonstrate this scalability, in this paper, we design and implement a 3D virtual systolic array to compute a tile QR decomposition of a tall-and-skinny dense matrix. Our implementation is based on a state-of-the-art algorithm that factorizes a panel based on a tree-reduction. Freed from the constraint of a planar layout, we present a three-dimensional virtual systolic array architecture for this algorithm. Using a runtime developed as a part of the Parallel Ultra Light Systolic Array Runtime (PULSAR) project, we demonstrate on a Cray-XT5 machine how our virtual systolic array can be mapped to a large-scale machine and obtain excellent parallel performance. This is an ∗This work is supported in part by grant #SHF-1117062: “Parallel Unified Linear algebra with Systolic ARrays” (PULSAR) from the National Science Foundation (NSF). Work was funded in part by the Ministry of Education and Science of the Russian Federation, Agreement N 14.607.21.0006 (unique identifier RFMEFI57714X0020). The authors would like to thank the National Institute for Computational Sciences (NICS) for a generous time allocation on the Kraken supercomputer. This work is supported in part by grant #SHF-1117062: “Parallel Unified Linear algebra with Systolic ARrays” (PULSAR) from the National Science Foundation (NSF). Work was funded in part by the Ministry of Education and Science of the Russian Federation,

FundersFunder number
National Science Foundation
Ministry of Education and Science of the Russian FederationRFMEFI57714X0020

    Keywords

    • QR decomposition
    • Systolic array
    • scheduling runtime
    • threading, message-passing, dataflow

    Fingerprint

    Dive into the research topics of 'Design and implementation of a large scale tree-based QR decomposition using a 3D virtual systolic array and a lightweight runtime'. Together they form a unique fingerprint.

    Cite this