Abstract
This paper introduces several frameworks for the design and implementation of high performance GPU kernels that target batch workloads with irregular sizes. Such workloads are ubiquitous in many scientific applications, including sparse direct solvers, astrophysics, and quantum chemistry. The paper addresses two main categories of frameworks, taking the Cholesky factorization as a case study. The first uses host-side kernel launches, and the second uses device-side launches. Within each category, different design options are introduced, with an emphasis on the advantages and the disadvantages of each approach. Our best performing design outperforms the state-of-the-art CPU implementation, scoring up to 4.7× speedup in double precision on a Pascal P100 GPU.
Original language | English |
---|---|
Title of host publication | 2018 IEEE High Performance Extreme Computing Conference, HPEC 2018 |
Publisher | Institute of Electrical and Electronics Engineers Inc. |
ISBN (Electronic) | 9781538659892 |
DOIs | |
State | Published - Nov 26 2018 |
Event | 2018 IEEE High Performance Extreme Computing Conference, HPEC 2018 - Waltham, United States Duration: Sep 25 2018 → Sep 27 2018 |
Publication series
Name | 2018 IEEE High Performance Extreme Computing Conference, HPEC 2018 |
---|
Conference
Conference | 2018 IEEE High Performance Extreme Computing Conference, HPEC 2018 |
---|---|
Country/Territory | United States |
City | Waltham |
Period | 09/25/18 → 09/27/18 |
Funding
This work is partially supported by NSF grant No. OAC-1740250 and CSR 1514286, NVIDIA, and by the Exascale Computing Project (17-SC-20-SC). This work is partially supported by NSF grant No. OAC-1740250 and CSR 1514286, NVIDIA, and by the Exascale Computing Project (17-SC-20-SC)
Keywords
- Batch Linear Algebra
- GPU Computing
- Matrix Factorization