Abstract
The use of low-precision computations is popular in accelerating machine learning and artificial intelligence (AI) applications. Hardware architectures, such as high-end graphics processing units (GPUs), now support native 16-bit floating-point arithmetic (i.e., half-precision). While half precision provides a natural 2×/4× speedup against the performance of single/double precisions, respectively, modern GPUs are equipped with hard- ware accelerators that further boost the FP16 performance. These accelerators, known as tensor cores (TCs), have a theoretical peak performance that is 8×/16× faster than FP32/FP64 performance, respectively. Such a high level of performance has encouraged researchers to harness the compute power of TCs outside AI applications. This paper presents a mixed-precision dense linear solver (Ax = b) for complex matrices using the GPU's TC units. Unlike similar efforts that have discussed accelerating Ax = b in real FP16 arithmetic, this paper focuses on complex FP16 precisions. The developed solution uses a 'half-complex' pre- cision to accelerate the solution of Ax = b while maintaining complex FP32 precision accuracy. The proposed solver requires the development of a high-performance mixed-precision ma- trix multiplication (CGEMM-FP16) that accepts half-complex inputs, and uses the TCs' full-precision products and FP32 accumulations for the computation. We discuss two designs and their performance. Similar to the way fast GEMMs power the performance of LAPACK, the mixed-precision CGEMM- FP16 can enable the development of mixed-precision LAPACK algorithms. We illustrate this by integrating both CGEMM-FP16s into the development of mixed-precision LU factorizations of complex matrices. Finally, an iterative refinement solver is used to deliver complex FP32 accuracy using a preconditioned GMRES solver. Our experiments, conducted on V100 GPUs, show that the mixed-precision solver can be up to 2.5× faster than a full single-complex precision solver.
Original language | English |
---|---|
Title of host publication | Proceedings of ScalA 2019 |
Subtitle of host publication | 10th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems - Held in conjunction with SC 2019: The International Conference for High Performance Computing, Networking, Storage and Analysis |
Publisher | Institute of Electrical and Electronics Engineers Inc. |
Pages | 17-24 |
Number of pages | 8 |
ISBN (Electronic) | 9781728159898 |
DOIs | |
State | Published - Nov 2019 |
Externally published | Yes |
Event | 10th IEEE/ACM Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, ScalA 2019 - Denver, United States Duration: Nov 18 2019 → … |
Publication series
Name | Proceedings of ScalA 2019: 10th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems - Held in conjunction with SC 2019: The International Conference for High Performance Computing, Networking, Storage and Analysis |
---|
Conference
Conference | 10th IEEE/ACM Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, ScalA 2019 |
---|---|
Country/Territory | United States |
City | Denver |
Period | 11/18/19 → … |
Funding
ACKNOWLEDGMENT This work is partially supported by NSF grant No. OAC-1740250 and CSR 1514286, NVIDIA, and by the Exascale Computing Project (17-SC-20-SC). This work is partially supported by NSF grant No. OAC- 1740250
Keywords
- Half precision
- Tensor cores FP16 arithmetic
- mixed-precision solvers