Towards Half-Precision Computation for Complex Matrices: A Case Study for Mixed Precision Solvers on GPUs

Ahmad Abdelfattah, Stanimire Tomov, Jack Dongarra

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

11 Scopus citations

Abstract

The use of low-precision computations is popular in accelerating machine learning and artificial intelligence (AI) applications. Hardware architectures, such as high-end graphics processing units (GPUs), now support native 16-bit floating-point arithmetic (i.e., half-precision). While half precision provides a natural 2×/4× speedup against the performance of single/double precisions, respectively, modern GPUs are equipped with hard- ware accelerators that further boost the FP16 performance. These accelerators, known as tensor cores (TCs), have a theoretical peak performance that is 8×/16× faster than FP32/FP64 performance, respectively. Such a high level of performance has encouraged researchers to harness the compute power of TCs outside AI applications. This paper presents a mixed-precision dense linear solver (Ax = b) for complex matrices using the GPU's TC units. Unlike similar efforts that have discussed accelerating Ax = b in real FP16 arithmetic, this paper focuses on complex FP16 precisions. The developed solution uses a 'half-complex' pre- cision to accelerate the solution of Ax = b while maintaining complex FP32 precision accuracy. The proposed solver requires the development of a high-performance mixed-precision ma- trix multiplication (CGEMM-FP16) that accepts half-complex inputs, and uses the TCs' full-precision products and FP32 accumulations for the computation. We discuss two designs and their performance. Similar to the way fast GEMMs power the performance of LAPACK, the mixed-precision CGEMM- FP16 can enable the development of mixed-precision LAPACK algorithms. We illustrate this by integrating both CGEMM-FP16s into the development of mixed-precision LU factorizations of complex matrices. Finally, an iterative refinement solver is used to deliver complex FP32 accuracy using a preconditioned GMRES solver. Our experiments, conducted on V100 GPUs, show that the mixed-precision solver can be up to 2.5× faster than a full single-complex precision solver.

Original languageEnglish
Title of host publicationProceedings of ScalA 2019
Subtitle of host publication10th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems - Held in conjunction with SC 2019: The International Conference for High Performance Computing, Networking, Storage and Analysis
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages17-24
Number of pages8
ISBN (Electronic)9781728159898
DOIs
StatePublished - Nov 2019
Externally publishedYes
Event10th IEEE/ACM Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, ScalA 2019 - Denver, United States
Duration: Nov 18 2019 → …

Publication series

NameProceedings of ScalA 2019: 10th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems - Held in conjunction with SC 2019: The International Conference for High Performance Computing, Networking, Storage and Analysis

Conference

Conference10th IEEE/ACM Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, ScalA 2019
Country/TerritoryUnited States
CityDenver
Period11/18/19 → …

Funding

ACKNOWLEDGMENT This work is partially supported by NSF grant No. OAC-1740250 and CSR 1514286, NVIDIA, and by the Exascale Computing Project (17-SC-20-SC). This work is partially supported by NSF grant No. OAC- 1740250

FundersFunder number
National Science FoundationCSR 1514286, 1740250, OAC-1740250
NVIDIA17-SC-20-SC

    Keywords

    • Half precision
    • Tensor cores FP16 arithmetic
    • mixed-precision solvers

    Fingerprint

    Dive into the research topics of 'Towards Half-Precision Computation for Complex Matrices: A Case Study for Mixed Precision Solvers on GPUs'. Together they form a unique fingerprint.

    Cite this