Algorithms and optimization techniques for high-performance matrix-matrix multiplications of very small matrices

I. Masliah, A. Abdelfattah, A. Haidar, S. Tomov, M. Baboulin, J. Falcou, J. Dongarra

Research output: Contribution to journalArticlepeer-review

20 Scopus citations

Abstract

Expressing scientific computations in terms of BLAS, and in particular the general dense matrix-matrix multiplication (GEMM), is of fundamental importance for obtaining high performance portability across architectures. However, GEMMs for small matrices of sizes smaller than 32 are not sufficiently optimized in existing libraries. We consider the computation of many small GEMMs and its performance portability for a wide range of computer architectures, including Intel CPUs, ARM, IBM, Intel Xeon Phi, and GPUs. These computations often occur in applications like big data analytics, machine learning, high-order finite element methods (FEM), and others. The GEMMs are grouped together in a single batched routine. For these cases, we present algorithms and their optimization techniques that are specialized for the matrix sizes and architectures of interest. We derive a performance model and show that the new developments can be tuned to obtain performance that is within 90% of the optimal for any of the architectures of interest. For example, on a V100 GPU for square matrices of size 32, we achieve an execution rate of about 1600 gigaFLOP/s in double-precision arithmetic, which is 95% of the theoretically derived peak for this computation on a V100 GPU. We also show that these results outperform currently available state-of-the-art implementations such as vendor-tuned math libraries, including Intel MKL and NVIDIA CUBLAS, as well as open-source libraries like OpenBLAS and Eigen.

Original languageEnglish
Pages (from-to)1-21
Number of pages21
JournalParallel Computing
Volume81
DOIs
StatePublished - Jan 2019
Externally publishedYes

Funding

This material is based in part upon work supported by the US NSF under Grants no. OAC-1740250 , NVIDIA, and under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344 . This material is based in part upon work supported by the US NSF under Grants no. OAC-1740250, NVIDIA, and under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.

FundersFunder number
National Science FoundationOAC-1740250
U.S. Department of Energy
Lawrence Livermore National LaboratoryDE-AC52-07NA27344
NVIDIA
National Science Foundation

    Keywords

    • Autotuning
    • Batched GEMM
    • HPC
    • Matrix-matrix product
    • Optimization
    • Small matrices

    Fingerprint

    Dive into the research topics of 'Algorithms and optimization techniques for high-performance matrix-matrix multiplications of very small matrices'. Together they form a unique fingerprint.

    Cite this