TY - GEN
T1 - Harnessing GPU Tensor cores for fast FP16 arithmetic to speed up mixed-precision iterative refinement solvers
AU - Haidar, Azzam
AU - Tomov, Stanimire
AU - Dongarra, Jack
AU - Higham, Nicholas J.
N1 - Publisher Copyright:
© 2018 IEEE.
PY - 2018/7/2
Y1 - 2018/7/2
N2 - Low-precision floating-point arithmetic is a powerful tool for accelerating scientific computing applications, especially those in artificial intelligence. Here, we present an investigation showing that other high-performance computing (HPC) applications can also harness this power. Specifically, we use the general HPC problem, Ax b, where A is a large dense matrix, and a double precision (FP64) solution is needed for accuracy. Our approach is based on mixed-precision (FP16-FP64) iterative refinement, and we generalize and extend prior advances into a framework, for which we develop architecture-specific algorithms and highly tuned implementations. These new methods show how using half-precision Tensor Cores (FP16-TC) for the arithmetic can provide up to 4× speedup. This is due to the performance boost that the FP16-TC provide as well as to the improved accuracy over the classical FP16 arithmetic that is obtained because the GEMM accumulation occurs in FP32 arithmetic.
AB - Low-precision floating-point arithmetic is a powerful tool for accelerating scientific computing applications, especially those in artificial intelligence. Here, we present an investigation showing that other high-performance computing (HPC) applications can also harness this power. Specifically, we use the general HPC problem, Ax b, where A is a large dense matrix, and a double precision (FP64) solution is needed for accuracy. Our approach is based on mixed-precision (FP16-FP64) iterative refinement, and we generalize and extend prior advances into a framework, for which we develop architecture-specific algorithms and highly tuned implementations. These new methods show how using half-precision Tensor Cores (FP16-TC) for the arithmetic can provide up to 4× speedup. This is due to the performance boost that the FP16-TC provide as well as to the improved accuracy over the classical FP16 arithmetic that is obtained because the GEMM accumulation occurs in FP32 arithmetic.
KW - FP16 Arithmetic
KW - GPU Computing
KW - Half Precision
KW - Iterative Refinement Computation
KW - Linear Algebra
KW - Mixed Precision Solvers
UR - http://www.scopus.com/inward/record.url?scp=85062956114&partnerID=8YFLogxK
U2 - 10.1109/SC.2018.00050
DO - 10.1109/SC.2018.00050
M3 - Conference contribution
AN - SCOPUS:85062956114
T3 - Proceedings - International Conference for High Performance Computing, Networking, Storage, and Analysis, SC 2018
SP - 603
EP - 613
BT - Proceedings - International Conference for High Performance Computing, Networking, Storage, and Analysis, SC 2018
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2018 International Conference for High Performance Computing, Networking, Storage, and Analysis, SC 2018
Y2 - 11 November 2018 through 16 November 2018
ER -