## Abstract

Machine learning and artificial intelligence (AI) applications often rely on performing many small matrix operations—in particular general matrix–matrix multiplication (GEMM). These operations are usually performed in a reduced precision, such as the 16-bit floating-point format (i.e., half precision or FP16). The GEMM operation is also very important for dense linear algebra algorithms, and half-precision GEMM operations can be used in mixed-precision linear solvers. Therefore, high-performance batched GEMM operations in reduced precision are significantly important, not only for deep learning frameworks, but also for scientific applications that rely on batched linear algebra, such as tensor contractions and sparse direct solvers. This paper presents optimized batched GEMM kernels for graphics processing units (GPUs) in FP16 arithmetic. The paper addresses both real and complex half-precision computations on the GPU. The proposed design takes advantage of the Tensor Core technology that was recently introduced in CUDA-enabled GPUs. With eight tuning parameters introduced in the design, the developed kernels have a high degree of flexibility that overcomes the limitations imposed by the hardware and software (in the form of discrete configurations for the Tensor Core APIs). For real FP16 arithmetic, performance speedups are observed against cuBLAS for sizes up to 128, and range between 1.5× and 2.5×. For the complex FP16 GEMM kernel, the speedups are between 1.7× and 7× thanks to a design that uses the standard interleaved matrix layout, in contrast with the planar layout required by the vendor's solution. The paper also discusses special optimizations for extremely small matrices, where even higher performance gains are achievable.

Original language | English |
---|---|

Pages (from-to) | 188-201 |

Number of pages | 14 |

Journal | Journal of Parallel and Distributed Computing |

Volume | 145 |

DOIs | |

State | Published - Nov 2020 |

### Bibliographical note

Publisher Copyright:© 2020 Elsevier Inc.

### Funding

This work is partially supported by NSF Grant No. OAC 1740250 and CSR 1514286 , NVIDIA, and the Department of Energy under the Exascale Computing Project (17-SC-20-SC and LLNL subcontract under DOE contract DE-AC52-07NA27344).

Funders | Funder number |
---|---|

National Science Foundation | CSR 1514286, 1740250, OAC 1740250 |

U.S. Department of Energy | 17-SC-20-SC, DE-AC52-07NA27344 |

NVIDIA |

## Keywords

- Batch linear algebra
- GPU computing
- Half precision
- Matrix multiplication