Abstract
This paper presents a progressive approach for optimizing the batched LU factorization on graphics processing units (GPUs). The paper shows that the reliance on level-3 BLAS routines for performance does not really pay off, and that it is indeed important to pay attention to the memory-bound part of the algorithm, especially when the problem size is very small. In this context, we develop a size-aware multi-level blocking technique that utilizes different granularities for kernel fusion according to the problem size. Our experiments, which are conducted on a Tesla V100 GPU, show that the multi-level blocking technique achieves speedups for single/double precisions that are up to 3.28×/2.69× against the generic LAPACK-style implementation. It is also up to 8.72×/7.2× faster than the cuBLAS library for single and double precisions, respectively. The developed solution is integrated into the open-source MAGMA library.
| Original language | English |
|---|---|
| Title of host publication | 2019 IEEE High Performance Extreme Computing Conference, HPEC 2019 |
| Publisher | Institute of Electrical and Electronics Engineers Inc. |
| ISBN (Electronic) | 9781728150208 |
| DOIs | |
| State | Published - Sep 2019 |
| Externally published | Yes |
| Event | 2019 IEEE High Performance Extreme Computing Conference, HPEC 2019 - Waltham, United States Duration: Sep 24 2019 → Sep 26 2019 |
Publication series
| Name | 2019 IEEE High Performance Extreme Computing Conference, HPEC 2019 |
|---|
Conference
| Conference | 2019 IEEE High Performance Extreme Computing Conference, HPEC 2019 |
|---|---|
| Country/Territory | United States |
| City | Waltham |
| Period | 09/24/19 → 09/26/19 |
Funding
ACKNOWLEDGMENT This work is partially supported by NSF grant No. OAC-1740250 and CSR 1514286, NVIDIA, and by the Exascale Computing Project (17-SC-20-SC). This work is partially supported by NSF grant No. OAC- 1740250 and CSR 1514286, NVIDIA, and by the Exascale Computing Project (17-SC-20-SC).
Keywords
- Batch computation
- GPU computing
- LU factorization