TY - GEN
T1 - Towards batched linear solvers on accelerated hardware platforms
AU - Haidar, Azzam
AU - Dong, Tingxing
AU - Luszczek, Piotr
AU - Tomov, Stanimire
AU - Dongarra, Jack
PY - 2015/1/24
Y1 - 2015/1/24
N2 - As hardware evolves, an increasingly effective approach to develop energy efficient, high-performance solvers, is to design them to work on many small and independent problems. Indeed, many applications already need this functionality, especially for GPUs, which are known to be currently about four to five times more energy efficient than multicore CPUs for every floating-point operation. In this paper, we describe the development of the main one-sided factorizations: LU, QR, and Cholesky; that are needed for a set of small dense matrices to work in parallel. We refer to such algorithms as batched factorizations. Our approach is based on representing the algorithms as a sequence of batched BLAS routines for GPU-contained execution. Note that this is similar in functionality to the LAPACK and the hybrid MAGMA algorithms for large-matrix factorizations. But it is different from a straightforward approach, whereby each of GPU's symmetric multiprocessors factorizes a single problem at a time. We illustrate how our performance analysis together with the profiling and tracing tools guided the development of batched factorizations to achieve up to 2-fold speedup and 3-fold better energy efficiency compared to our highly optimized batched CPU implementations based on the MKL library on a two-sockets, Intel Sandy Bridge server. Compared to a batched LU factorization featured in the NVIDIA's CUBLAS library for GPUs, we achieves up to 2.5-fold speedup on the K40 GPU.
AB - As hardware evolves, an increasingly effective approach to develop energy efficient, high-performance solvers, is to design them to work on many small and independent problems. Indeed, many applications already need this functionality, especially for GPUs, which are known to be currently about four to five times more energy efficient than multicore CPUs for every floating-point operation. In this paper, we describe the development of the main one-sided factorizations: LU, QR, and Cholesky; that are needed for a set of small dense matrices to work in parallel. We refer to such algorithms as batched factorizations. Our approach is based on representing the algorithms as a sequence of batched BLAS routines for GPU-contained execution. Note that this is similar in functionality to the LAPACK and the hybrid MAGMA algorithms for large-matrix factorizations. But it is different from a straightforward approach, whereby each of GPU's symmetric multiprocessors factorizes a single problem at a time. We illustrate how our performance analysis together with the profiling and tracing tools guided the development of batched factorizations to achieve up to 2-fold speedup and 3-fold better energy efficiency compared to our highly optimized batched CPU implementations based on the MKL library on a two-sockets, Intel Sandy Bridge server. Compared to a batched LU factorization featured in the NVIDIA's CUBLAS library for GPUs, we achieves up to 2.5-fold speedup on the K40 GPU.
KW - Batched factorization
KW - Hardware accelerators
KW - Numerical linear algebra
KW - Numerical software libraries
KW - One-sided factorization algorithms
UR - http://www.scopus.com/inward/record.url?scp=84939207423&partnerID=8YFLogxK
U2 - 10.1145/2688500.2688534
DO - 10.1145/2688500.2688534
M3 - Conference contribution
AN - SCOPUS:84939207423
T3 - Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP
SP - 261
EP - 262
BT - 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP 2015 - Proceedings
PB - Association for Computing Machinery
T2 - 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP 2015
Y2 - 7 February 2015 through 11 February 2015
ER -