Progressive Optimization of Batched LU Factorization on GPUs

Ahmad Abdelfattah, Stanimire Tomov, Jack Dongarra

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

8 Scopus citations

Abstract

This paper presents a progressive approach for optimizing the batched LU factorization on graphics processing units (GPUs). The paper shows that the reliance on level-3 BLAS routines for performance does not really pay off, and that it is indeed important to pay attention to the memory-bound part of the algorithm, especially when the problem size is very small. In this context, we develop a size-aware multi-level blocking technique that utilizes different granularities for kernel fusion according to the problem size. Our experiments, which are conducted on a Tesla V100 GPU, show that the multi-level blocking technique achieves speedups for single/double precisions that are up to 3.28×/2.69× against the generic LAPACK-style implementation. It is also up to 8.72×/7.2× faster than the cuBLAS library for single and double precisions, respectively. The developed solution is integrated into the open-source MAGMA library.

Original languageEnglish
Title of host publication2019 IEEE High Performance Extreme Computing Conference, HPEC 2019
PublisherInstitute of Electrical and Electronics Engineers Inc.
ISBN (Electronic)9781728150208
DOIs
StatePublished - Sep 2019
Externally publishedYes
Event2019 IEEE High Performance Extreme Computing Conference, HPEC 2019 - Waltham, United States
Duration: Sep 24 2019Sep 26 2019

Publication series

Name2019 IEEE High Performance Extreme Computing Conference, HPEC 2019

Conference

Conference2019 IEEE High Performance Extreme Computing Conference, HPEC 2019
Country/TerritoryUnited States
CityWaltham
Period09/24/1909/26/19

Funding

ACKNOWLEDGMENT This work is partially supported by NSF grant No. OAC-1740250 and CSR 1514286, NVIDIA, and by the Exascale Computing Project (17-SC-20-SC). This work is partially supported by NSF grant No. OAC- 1740250 and CSR 1514286, NVIDIA, and by the Exascale Computing Project (17-SC-20-SC).

FundersFunder number
National Science FoundationCSR 1514286, OAC- 1740250
National Sleep Foundation
NVIDIA17-SC-20-SC

    Keywords

    • Batch computation
    • GPU computing
    • LU factorization

    Fingerprint

    Dive into the research topics of 'Progressive Optimization of Batched LU Factorization on GPUs'. Together they form a unique fingerprint.

    Cite this