TY - GEN
T1 - Task-Based Polar Decomposition Using SLATE on Massively Parallel Systems with Hardware Accelerators
AU - Sukkari, Dalal
AU - Gates, Mark
AU - Al Farhan, Mohammed
AU - Anzt, Hartwig
AU - Dongarra, Jack
N1 - Publisher Copyright:
© 2023 ACM.
PY - 2023/11/12
Y1 - 2023/11/12
N2 - We investigate a new task-based implementation of the polar decomposition on massively parallel systems augmented with multiple GPUs using SLATE. We implement the iterative QR Dynamically-Weighted Halley (QDWH) algorithm, whose building blocks mainly consist of compute-bound matrix operations, allowing for high levels of parallelism to be exploited on various hardware architectures, such as NVIDIA, AMD, and Intel GPU-based systems. To achieve both performance and portability, we implement our QDWH-based polar decomposition in the SLATE library, which uses efficient techniques in dense linear algebra, such as 2D block cyclic data distribution and communication-avoiding algorithms, as well as modern parallel programming approaches, such as dynamic scheduling and communication overlapping, and uses OpenMP tasks to track data dependencies. We report numerical accuracy and performance results. The benchmarking campaign reveals up to an 18-fold performance speedup of the GPU accelerated implementation compared to the existing state-of-the-art implementation for the polar decomposition.
AB - We investigate a new task-based implementation of the polar decomposition on massively parallel systems augmented with multiple GPUs using SLATE. We implement the iterative QR Dynamically-Weighted Halley (QDWH) algorithm, whose building blocks mainly consist of compute-bound matrix operations, allowing for high levels of parallelism to be exploited on various hardware architectures, such as NVIDIA, AMD, and Intel GPU-based systems. To achieve both performance and portability, we implement our QDWH-based polar decomposition in the SLATE library, which uses efficient techniques in dense linear algebra, such as 2D block cyclic data distribution and communication-avoiding algorithms, as well as modern parallel programming approaches, such as dynamic scheduling and communication overlapping, and uses OpenMP tasks to track data dependencies. We report numerical accuracy and performance results. The benchmarking campaign reveals up to an 18-fold performance speedup of the GPU accelerated implementation compared to the existing state-of-the-art implementation for the polar decomposition.
KW - Linear algebra
KW - QDWH
KW - polar decomposition
UR - http://www.scopus.com/inward/record.url?scp=85178087515&partnerID=8YFLogxK
U2 - 10.1145/3624062.3624248
DO - 10.1145/3624062.3624248
M3 - Conference contribution
AN - SCOPUS:85178087515
T3 - ACM International Conference Proceeding Series
SP - 1680
EP - 1687
BT - Proceedings of 2023 SC Workshops of the International Conference on High Performance Computing, Network, Storage, and Analysis, SC Workshops 2023
PB - Association for Computing Machinery
T2 - 2023 International Conference on High Performance Computing, Network, Storage, and Analysis, SC Workshops 2023
Y2 - 12 November 2023 through 17 November 2023
ER -