TY - GEN
T1 - Optimization for performance and energy for batched matrix computations on GPUs
AU - Haidar, Azzam
AU - Dong, Tingxing
AU - Luszczek, Piotr
AU - Tomov, Stanimire
AU - Dongarra, Jack
N1 - Publisher Copyright:
Copyright 2015 ACM.
PY - 2015/2/7
Y1 - 2015/2/7
N2 - As modern hardware keeps evolving, an increasingly effective approach to develop energy efficient and high-performance solvers is to design them to work on many small size in-dependent problems. Many applications already need this functionality, especially for GPUs, which are known to be currently about four to five times more energy efficient than multicore CPUs. We describe the development of the main one-sided factorizations that work for a set of small dense matrices in parallel, and we illustrate our techniques on the LU and Cholesky factorizations. We refer to this mode of operation as a batched factorization. Our approach is based on representing the algorithms as a sequence of batched BLAS routines for GPU-only execution. The goal of avoiding multicore CPU use, e.g., as in the hybrid CPU-GPU algorithms, is to exclusively benefit from the GPU's significantly higher energy efficiency, as well as from the removal of the costly CPU-to-GPU communications. Furthermore, we do not use a single symmetric multiprocessor (on the GPU) to factorize a single problem at a time. We illustrate how our performance analysis and the use of profiling and tracing tools guided the development and optimization of batched factorizations to achieve up to 2-fold speedup and 3-fold better energy efficiency compared to our highly optimized batched CPU implementations based on the MKL library (when using two sockets of Intel Sandy Bridge CPUs). Compared to a batched LU factorization featured in the CUBLAS library for GPUs, we achieved up to 2:5 speedup on the K40 GPU.
AB - As modern hardware keeps evolving, an increasingly effective approach to develop energy efficient and high-performance solvers is to design them to work on many small size in-dependent problems. Many applications already need this functionality, especially for GPUs, which are known to be currently about four to five times more energy efficient than multicore CPUs. We describe the development of the main one-sided factorizations that work for a set of small dense matrices in parallel, and we illustrate our techniques on the LU and Cholesky factorizations. We refer to this mode of operation as a batched factorization. Our approach is based on representing the algorithms as a sequence of batched BLAS routines for GPU-only execution. The goal of avoiding multicore CPU use, e.g., as in the hybrid CPU-GPU algorithms, is to exclusively benefit from the GPU's significantly higher energy efficiency, as well as from the removal of the costly CPU-to-GPU communications. Furthermore, we do not use a single symmetric multiprocessor (on the GPU) to factorize a single problem at a time. We illustrate how our performance analysis and the use of profiling and tracing tools guided the development and optimization of batched factorizations to achieve up to 2-fold speedup and 3-fold better energy efficiency compared to our highly optimized batched CPU implementations based on the MKL library (when using two sockets of Intel Sandy Bridge CPUs). Compared to a batched LU factorization featured in the CUBLAS library for GPUs, we achieved up to 2:5 speedup on the K40 GPU.
KW - Batched factorization
KW - Hardware accelerators
KW - Numerical linear algebra
KW - Numerical software libraries
KW - One-sided factorization algorithms
UR - http://www.scopus.com/inward/record.url?scp=84938873257&partnerID=8YFLogxK
U2 - 10.1145/2716282.2716288
DO - 10.1145/2716282.2716288
M3 - Conference contribution
AN - SCOPUS:84938873257
T3 - ACM International Conference Proceeding Series
SP - 59
EP - 69
BT - ACM International Conference Proceeding Series
A2 - Gong, Xiang
PB - Association for Computing Machinery
T2 - 8th Annual Workshop on General Purpose Processing using Graphics Processing Unit, GPGPU 2015
Y2 - 7 February 2015
ER -