TY - GEN
T1 - LU factorization of small matrices
T2 - 16th IEEE International Conference on High Performance Computing and Communications, HPCC 2014, 11th IEEE International Conference on Embedded Software and Systems, ICESS 2014 and 6th International Symposium on Cyberspace Safety and Security, CSS 2014
AU - Dong, Tingxing
AU - Haidar, Azzam
AU - Luszczek, Piotr
AU - Harris, James Austin
AU - Tomov, Stanimire
AU - Dongarra, Jack
N1 - Publisher Copyright:
© 2014 IEEE.
PY - 2014/3/9
Y1 - 2014/3/9
N2 - Gaussian Elimination is commonly used to solve dense linear systems in scientific models. In a large number of applications, a need arises to solve many small size problems, instead of few large linear systems. The size of each of these small linear systems depends on the number of the ordinary differential equations (ODEs) used in the model, and can be on the order of hundreds of unknowns. To efficiently exploit the computing power of modern accelerator hardware, these linear systems are processed in batches. To improve the numerical stability, at least partial pivoting is required, most often accomplished with row pivoting. However, row pivoting can result in a severe performance penalty on GPUs because it brings in thread divergence and non-coalesced memory accesses. In this paper, we propose a batched LU factorization for GPUs by using amulti-level blocked right looking algorithm that preserves the data layout but minimizes the penalty of partial pivoting. Our batched LU achieves up to 2.5-fold speedup when compared to the alternative CUBLAS solution on a K40c GPU.
AB - Gaussian Elimination is commonly used to solve dense linear systems in scientific models. In a large number of applications, a need arises to solve many small size problems, instead of few large linear systems. The size of each of these small linear systems depends on the number of the ordinary differential equations (ODEs) used in the model, and can be on the order of hundreds of unknowns. To efficiently exploit the computing power of modern accelerator hardware, these linear systems are processed in batches. To improve the numerical stability, at least partial pivoting is required, most often accomplished with row pivoting. However, row pivoting can result in a severe performance penalty on GPUs because it brings in thread divergence and non-coalesced memory accesses. In this paper, we propose a batched LU factorization for GPUs by using amulti-level blocked right looking algorithm that preserves the data layout but minimizes the penalty of partial pivoting. Our batched LU achieves up to 2.5-fold speedup when compared to the alternative CUBLAS solution on a K40c GPU.
KW - GPU
KW - Gaussian Elimination
KW - batched
UR - http://www.scopus.com/inward/record.url?scp=84983164717&partnerID=8YFLogxK
U2 - 10.1109/HPCC.2014.30
DO - 10.1109/HPCC.2014.30
M3 - Conference contribution
AN - SCOPUS:84983164717
T3 - Proceedings - 16th IEEE International Conference on High Performance Computing and Communications, HPCC 2014, 11th IEEE International Conference on Embedded Software and Systems, ICESS 2014 and 6th International Symposium on Cyberspace Safety and Security, CSS 2014
SP - 157
EP - 160
BT - Proceedings - 16th IEEE International Conference on High Performance Computing and Communications, HPCC 2014, 11th IEEE International Conference on Embedded Software and Systems, ICESS 2014 and 6th International Symposium on Cyberspace Safety and Security, CSS 2014
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 20 August 2014 through 22 August 2014
ER -