Abstract
Batch matrix operations address the case of solving the same linear algebra problem for a very large number of very small matrices. In this paper, we focus on implementing the batch Cholesky factorization in CUDA, in single precision arithmetic, for NVIDIA GPUs. Specifically, we look into the benefits of using noncanonical data layouts, where consecutive memory locations store elements with the same row and column index in a set of consecutive matrices. We discuss a number of different implementation options and tuning parameters. We demonstrate superior performance to traditional implementations for the case of very small matrices.
| Original language | English |
|---|---|
| Title of host publication | Proceedings - 2017 IEEE 31st International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2017 |
| Publisher | Institute of Electrical and Electronics Engineers Inc. |
| Pages | 1408-1417 |
| Number of pages | 10 |
| ISBN (Electronic) | 9781538634080 |
| DOIs | |
| State | Published - Jun 30 2017 |
| Externally published | Yes |
| Event | 31st IEEE International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2017 - Orlando, United States Duration: May 29 2017 → Jun 2 2017 |
Publication series
| Name | Proceedings - 2017 IEEE 31st International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2017 |
|---|
Conference
| Conference | 31st IEEE International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2017 |
|---|---|
| Country/Territory | United States |
| City | Orlando |
| Period | 05/29/17 → 06/2/17 |
Funding
Autotuning of Computational Kernels” from the National Science Foundation.
Keywords
- Cholesky factorization
- GPU computing
- batch computation
- data layout
- numerical linear algebra