A Guide for Achieving High Performance with Very Small Matrices on GPU: A Case Study of Batched LU and Cholesky Factorizations

Azzam Haidar, Ahmad Abdelfattah, Mawussi Zounon, Stanimire Tomov, Jack Dongarra

Research output: Contribution to journalArticlepeer-review

15 Scopus citations

Abstract

We present a high-performance GPU kernel with a substantial speedup over vendor libraries for very small matrix computations. In addition, we discuss most of the challenges that hinder the design of efficient GPU kernels for small matrix algorithms. We propose relevant algorithm analysis to harness the full power of a GPU, and strategies for predicting the performance, before introducing a proper implementation. We develop a theoretical analysis and a methodology for high-performance linear solvers for very small matrices. As test cases, we take the Cholesky and LU factorizations and show how the proposed methodology enables us to achieve a performance close to the theoretical upper bound of the hardware. This work investigates and proposes novel algorithms for designing highly optimized GPU kernels for solving batches of hundreds of thousands of small-size Cholesky and LU factorizations. Our focus on efficient batched Cholesky and batched LU kernels is motivated by the increasing need for these kernels in scientific simulations (e.g., astrophysics applications). Techniques for optimal memory traffic, register blocking, and tunable concurrency are incorporated in our proposed design. The proposed GPU kernels achieve performance speedups versus CUBLAS of up to 6× for the factorizations, using double precision arithmetic on an NVIDIA Pascal P100 GPU.

Original languageEnglish
Pages (from-to)973-984
Number of pages12
JournalIEEE Transactions on Parallel and Distributed Systems
Volume29
Issue number5
DOIs
StatePublished - May 1 2018

Funding

This material is based upon work supported by the National Science Foundation under Grants No. CSR 1514286 and No. OAC 1740250, NVIDIA, and the Department of Energy.

FundersFunder number
National Science FoundationCSR 1514286, OAC 1740250
U.S. Department of Energy
NVIDIA
Horizon 2020 Framework Programme671633

    Keywords

    • Batched computation
    • GPUs
    • variable small sizes

    Fingerprint

    Dive into the research topics of 'A Guide for Achieving High Performance with Very Small Matrices on GPU: A Case Study of Batched LU and Cholesky Factorizations'. Together they form a unique fingerprint.

    Cite this