TY - JOUR
T1 - Factorization and Inversion of a Million Matrices using GPUs
T2 - International Conference on Computational Science ICCS 2017
AU - Abdelfattah, Ahmad
AU - Haidar, Azzam
AU - Tomov, Stanimire
AU - Dongarra, Jack
N1 - Publisher Copyright:
© 2017 The Authors. Published by Elsevier B.V.
PY - 2017
Y1 - 2017
N2 - This paper presents new algorithmic approaches and optimization techniques for LU factorization and matrix inversion of millions of very small matrices using GPUs. These problems appear in many scientific applications including astrophysics and generation of block-Jacobi preconditioners. We show that, for very small problem sizes, design and optimization of GPU kernels require a mindset different from the one usually used when designing LAPACK algorithms for GPUs. Techniques for optimal memory traffic, register blocking, and tunable concurrency are incorporated in our proposed design. We also take advantage of the small matrix sizes to eliminate the intermediate row interchanges in both the factorization and inversion kernels. The proposed GPU kernels achieve performance speedups vs. CUBLAS of up to 6× for the factorization, and 14× for the inversion, using double precision arithmetic on a Pascal P100 GPU.
AB - This paper presents new algorithmic approaches and optimization techniques for LU factorization and matrix inversion of millions of very small matrices using GPUs. These problems appear in many scientific applications including astrophysics and generation of block-Jacobi preconditioners. We show that, for very small problem sizes, design and optimization of GPU kernels require a mindset different from the one usually used when designing LAPACK algorithms for GPUs. Techniques for optimal memory traffic, register blocking, and tunable concurrency are incorporated in our proposed design. We also take advantage of the small matrix sizes to eliminate the intermediate row interchanges in both the factorization and inversion kernels. The proposed GPU kernels achieve performance speedups vs. CUBLAS of up to 6× for the factorization, and 14× for the inversion, using double precision arithmetic on a Pascal P100 GPU.
UR - http://www.scopus.com/inward/record.url?scp=85027326684&partnerID=8YFLogxK
U2 - 10.1016/j.procs.2017.05.250
DO - 10.1016/j.procs.2017.05.250
M3 - Conference article
AN - SCOPUS:85027326684
SN - 1877-0509
VL - 108
SP - 606
EP - 615
JO - Procedia Computer Science
JF - Procedia Computer Science
Y2 - 12 June 2017 through 14 June 2017
ER -