Autotuning Numerical Dense Linear Algebra for Batched Computation with GPU Hardware Accelerators

Jack Dongarra, Mark Gates, Jakub Kurzak, Piotr Luszczek, Yaohung M. Tsai

Research output: Contribution to journalArticlepeer-review

10 Scopus citations

Abstract

Computational problems in engineering and scientific disciplines often rely on the solution of many instances of small systems of linear equations, which are called batched solves. In this paper, we focus on the important variants of both batch Cholesky factorization and subsequent substitution. The former requires the linear system matrices to be symmetric positive definite (SPD). We describe the implementation and automated performance engineering of these kernels that implement the factorization and the two substitutions. Our target platforms are graphics processing units (GPUs), which over the past decade have become an attractive high-performance computing (HPC) target for solvers of linear systems of equations. Due to their throughput-oriented design, GPUs exhibit the highest processing rates among the available processors. However, without careful design and coding, this speed is mostly restricted to large matrix sizes. We show an automated exploration of the implementation space as well as a new data layout for the batched class of SPD solvers. Our tests involve the solution of many thousands of linear SPD systems of exactly the same size. The primary focus of our techniques is on the individual matrices in the batch that have dimensions ranging from 5-by-5 up to 100-by-100. We compare our autotuned solvers against the state-of-the-art solvers such as those provided through NVIDIA channels and publicly available in the optimized MAGMA library. The observed performance is competitive and many times superior for many practical cases. The advantage of the presented methodology lies in achieving these results in a portable manner across matrix storage formats and GPU hardware architecture platforms.

Original languageEnglish
Article number8476161
Pages (from-to)2040-2055
Number of pages16
JournalProceedings of the IEEE
Volume106
Issue number11
DOIs
StatePublished - Nov 2018
Externally publishedYes

Funding

Manuscript received July 3, 2017; revised May 31, 2018; accepted August 27, 2018. Date of publication September 28, 2018; date of current version October 25, 2018. This work was supported by the National Science Foundation under Grant #1642441: SI2-SSE: BONSAI: An Open Software Infrastructure for Parallel Autotuning of Computational Kernels, by the Department of Energy under Grant #DE-SC0010042, and by NVIDIA Corporation. (Corresponding author: Piotr Luszczek.) J. Dongarra is with the Department of Electrical Engineering and Computer Science, The University of Tennessee, Knoxville, TN 37996 USA, with the Oak Ridge National Laboratory, Oak Ridge, TN 37831 USA, and with the University of Manchester, Manchester M13 9PL, U.K. M. Gates, J. Kurzak, P. Luszczek, and Y. M. Tsai are with the Department of Electrical Engineering and Computer Science, The University of Tennessee, Knoxville, TN 37996 USA (e-mail: [email protected]).

FundersFunder number
National Science Foundation1642441, SI2-SSE
U.S. Department of Energy-SC0010042
NVIDIA

    Keywords

    • Dense numerical linear algebra
    • performance autotuning

    Fingerprint

    Dive into the research topics of 'Autotuning Numerical Dense Linear Algebra for Batched Computation with GPU Hardware Accelerators'. Together they form a unique fingerprint.

    Cite this