Matrix multiplication on batches of small matrices in half and half-complex precisions

Ahmad Abdelfattah, Stanimire Tomov, Jack Dongarra

Research output: Contribution to journalArticlepeer-review

10 Scopus citations

Abstract

Machine learning and artificial intelligence (AI) applications often rely on performing many small matrix operations—in particular general matrix–matrix multiplication (GEMM). These operations are usually performed in a reduced precision, such as the 16-bit floating-point format (i.e., half precision or FP16). The GEMM operation is also very important for dense linear algebra algorithms, and half-precision GEMM operations can be used in mixed-precision linear solvers. Therefore, high-performance batched GEMM operations in reduced precision are significantly important, not only for deep learning frameworks, but also for scientific applications that rely on batched linear algebra, such as tensor contractions and sparse direct solvers. This paper presents optimized batched GEMM kernels for graphics processing units (GPUs) in FP16 arithmetic. The paper addresses both real and complex half-precision computations on the GPU. The proposed design takes advantage of the Tensor Core technology that was recently introduced in CUDA-enabled GPUs. With eight tuning parameters introduced in the design, the developed kernels have a high degree of flexibility that overcomes the limitations imposed by the hardware and software (in the form of discrete configurations for the Tensor Core APIs). For real FP16 arithmetic, performance speedups are observed against cuBLAS for sizes up to 128, and range between 1.5× and 2.5×. For the complex FP16 GEMM kernel, the speedups are between 1.7× and 7× thanks to a design that uses the standard interleaved matrix layout, in contrast with the planar layout required by the vendor's solution. The paper also discusses special optimizations for extremely small matrices, where even higher performance gains are achievable.

Original languageEnglish
Pages (from-to)188-201
Number of pages14
JournalJournal of Parallel and Distributed Computing
Volume145
DOIs
StatePublished - Nov 2020

Bibliographical note

Publisher Copyright:
© 2020 Elsevier Inc.

Funding

This work is partially supported by NSF Grant No. OAC 1740250 and CSR 1514286 , NVIDIA, and the Department of Energy under the Exascale Computing Project (17-SC-20-SC and LLNL subcontract under DOE contract DE-AC52-07NA27344).

FundersFunder number
National Science FoundationCSR 1514286, 1740250, OAC 1740250
U.S. Department of Energy17-SC-20-SC, DE-AC52-07NA27344
NVIDIA

    Keywords

    • Batch linear algebra
    • GPU computing
    • Half precision
    • Matrix multiplication

    Fingerprint

    Dive into the research topics of 'Matrix multiplication on batches of small matrices in half and half-complex precisions'. Together they form a unique fingerprint.

    Cite this