Abstract
We present a set of new batched CUDA kernels for the LU factorization of a large collection of independent problems of different size, and the subsequent triangular solves. All kernels heavily exploit the registers of the graphics processing unit (GPU) in order to deliver high performance for small problems. The development of these kernels is motivated by the need for tackling this embarrasingly-parallel scenario in the context of block-Jacobi preconditioning that is relevant for the iterative solution of sparse linear systems.
Original language | English |
---|---|
Title of host publication | Proceedings - 46th International Conference on Parallel Processing, ICPP 2017 |
Publisher | Institute of Electrical and Electronics Engineers Inc. |
Pages | 91-100 |
Number of pages | 10 |
ISBN (Electronic) | 9781538610428 |
DOIs | |
State | Published - Sep 1 2017 |
Event | 46th International Conference on Parallel Processing, ICPP 2017 - Bristol, United Kingdom Duration: Aug 14 2017 → Aug 17 2017 |
Publication series
Name | Proceedings of the International Conference on Parallel Processing |
---|---|
ISSN (Print) | 0190-3918 |
Conference
Conference | 46th International Conference on Parallel Processing, ICPP 2017 |
---|---|
Country/Territory | United Kingdom |
City | Bristol |
Period | 08/14/17 → 08/17/17 |
Funding
This material is based upon work supported by the U.S. Department of Energy Office of Science, Office of Advanced Scientific Computing Research, Applied Mathematics program under Award Number DE-SC-0010042. H. Anzt was supported by the “Impuls und Vernetzungs-fond” of the Helmholtz Association. G. Flegar and E. S. Quintana-Ortí were supported by project TIN2014-53495-R of the MINECO, FEDER, and the EU H2020 project
Keywords
- Block-Jacobi
- GPU
- Variable-size batched LU