Abstract
Many scientific applications, ranging from national security to medical advances, require solving a number of relatively small-size independent problems. As the size of each individual problem does not provide sufficient parallelism for the underlying hardware, especially accelerators, these problems must be solved concurrently as a batch in order to saturate the hardware with enough work, hence the name batched computation. A possible simplification is to assume a uniform size for all problems. However, real applications do not necessarily satisfy such assumption. Consequently, an efficient solution for variable-size batched computations is required. This paper proposes a foundation for high performance variable-size batched matrix computation based on Graphics Processing Units (GPUs). Being throughput-oriented processors, GPUs favor regular computation and less divergence among threads, in order to achieve high performance. Therefore, the development of high performance numerical software for this kind of problems is challenging. As a case study, we developed efficient batched Cholesky factorization algorithms for relatively small matrices of different sizes. However, most of the strategies and the software developed, and in particular a set of variable size batched BLAS kernels, can be used in many other dense matrix factorizations, large scale sparse direct multifrontal solvers, and applications. We propose new interfaces and mechanisms to handle the irregular computation pattern on the GPU. According to the authors' knowledge, this is the first attempt to develop high performance software for this class of problems. Using a K40c GPU, our performance tests show speedups of up to 2.5x against two Sandy Bridge CPUs (8-core each) running Intel MKL library.
Original language | English |
---|---|
Title of host publication | Proceedings - 2016 IEEE 30th International Parallel and Distributed Processing Symposium, IPDPS 2016 |
Publisher | Institute of Electrical and Electronics Engineers Inc. |
Pages | 1249-1258 |
Number of pages | 10 |
ISBN (Electronic) | 9781509021406 |
DOIs | |
State | Published - Jul 18 2016 |
Event | 30th IEEE International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2016 - Chicago, United States Duration: May 23 2016 → May 27 2016 |
Publication series
Name | Proceedings - 2016 IEEE 30th International Parallel and Distributed Processing Symposium, IPDPS 2016 |
---|
Conference
Conference | 30th IEEE International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2016 |
---|---|
Country/Territory | United States |
City | Chicago |
Period | 05/23/16 → 05/27/16 |
Funding
ACKNOWLEDGMENTS: This material is based upon work supported by the National Science Foundation under Grant No. CSR 1514286, NVIDIA, the Department of Energy, and in part by the Russian Scientific Foundation, Agreement N14-11-00190.
Keywords
- Batched computation
- GPUs
- Variable small sizes