TY - GEN
T1 - Towards Half-Precision Computation for Complex Matrices
T2 - 10th IEEE/ACM Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, ScalA 2019
AU - Abdelfattah, Ahmad
AU - Tomov, Stanimire
AU - Dongarra, Jack
N1 - Publisher Copyright:
© 2019 IEEE.
PY - 2019/11
Y1 - 2019/11
N2 - The use of low-precision computations is popular in accelerating machine learning and artificial intelligence (AI) applications. Hardware architectures, such as high-end graphics processing units (GPUs), now support native 16-bit floating-point arithmetic (i.e., half-precision). While half precision provides a natural 2×/4× speedup against the performance of single/double precisions, respectively, modern GPUs are equipped with hard- ware accelerators that further boost the FP16 performance. These accelerators, known as tensor cores (TCs), have a theoretical peak performance that is 8×/16× faster than FP32/FP64 performance, respectively. Such a high level of performance has encouraged researchers to harness the compute power of TCs outside AI applications. This paper presents a mixed-precision dense linear solver (Ax = b) for complex matrices using the GPU's TC units. Unlike similar efforts that have discussed accelerating Ax = b in real FP16 arithmetic, this paper focuses on complex FP16 precisions. The developed solution uses a 'half-complex' pre- cision to accelerate the solution of Ax = b while maintaining complex FP32 precision accuracy. The proposed solver requires the development of a high-performance mixed-precision ma- trix multiplication (CGEMM-FP16) that accepts half-complex inputs, and uses the TCs' full-precision products and FP32 accumulations for the computation. We discuss two designs and their performance. Similar to the way fast GEMMs power the performance of LAPACK, the mixed-precision CGEMM- FP16 can enable the development of mixed-precision LAPACK algorithms. We illustrate this by integrating both CGEMM-FP16s into the development of mixed-precision LU factorizations of complex matrices. Finally, an iterative refinement solver is used to deliver complex FP32 accuracy using a preconditioned GMRES solver. Our experiments, conducted on V100 GPUs, show that the mixed-precision solver can be up to 2.5× faster than a full single-complex precision solver.
AB - The use of low-precision computations is popular in accelerating machine learning and artificial intelligence (AI) applications. Hardware architectures, such as high-end graphics processing units (GPUs), now support native 16-bit floating-point arithmetic (i.e., half-precision). While half precision provides a natural 2×/4× speedup against the performance of single/double precisions, respectively, modern GPUs are equipped with hard- ware accelerators that further boost the FP16 performance. These accelerators, known as tensor cores (TCs), have a theoretical peak performance that is 8×/16× faster than FP32/FP64 performance, respectively. Such a high level of performance has encouraged researchers to harness the compute power of TCs outside AI applications. This paper presents a mixed-precision dense linear solver (Ax = b) for complex matrices using the GPU's TC units. Unlike similar efforts that have discussed accelerating Ax = b in real FP16 arithmetic, this paper focuses on complex FP16 precisions. The developed solution uses a 'half-complex' pre- cision to accelerate the solution of Ax = b while maintaining complex FP32 precision accuracy. The proposed solver requires the development of a high-performance mixed-precision ma- trix multiplication (CGEMM-FP16) that accepts half-complex inputs, and uses the TCs' full-precision products and FP32 accumulations for the computation. We discuss two designs and their performance. Similar to the way fast GEMMs power the performance of LAPACK, the mixed-precision CGEMM- FP16 can enable the development of mixed-precision LAPACK algorithms. We illustrate this by integrating both CGEMM-FP16s into the development of mixed-precision LU factorizations of complex matrices. Finally, an iterative refinement solver is used to deliver complex FP32 accuracy using a preconditioned GMRES solver. Our experiments, conducted on V100 GPUs, show that the mixed-precision solver can be up to 2.5× faster than a full single-complex precision solver.
KW - Half precision
KW - Tensor cores FP16 arithmetic
KW - mixed-precision solvers
UR - http://www.scopus.com/inward/record.url?scp=85078704469&partnerID=8YFLogxK
U2 - 10.1109/ScalA49573.2019.00008
DO - 10.1109/ScalA49573.2019.00008
M3 - Conference contribution
AN - SCOPUS:85078704469
T3 - Proceedings of ScalA 2019: 10th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems - Held in conjunction with SC 2019: The International Conference for High Performance Computing, Networking, Storage and Analysis
SP - 17
EP - 24
BT - Proceedings of ScalA 2019
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 18 November 2019
ER -