Performance, design, and autotuning of batched GEMM for GPUs

Ahmad Abdelfattah, Azzam Haidar, Stanimire Tomov, Jack Dongarra

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

83 Scopus citations

Abstract

The general matrix-matrix multiplication (GEMM) is the most important numerical kernel in dense linear algebra, and is the key component for obtaining high performance in most LAPACK routines. As batched computations on relatively small problems continue to gain interest in many scientific applications, a need arises for a high performance GEMM kernel for batches of small matrices. Such a kernel should be well designed and tuned to handle small sizes, and to maintain high performance for realistic test cases found in the higher level LAPACK routines, and scientific computing applications in general. This paper presents a high performance batched GEMM kernel on Graphics Processing Units (GPUs). We address batched problems with both fixed and variable sizes, and show that specialized GEMM designs and a comprehensive autotuning process are needed to handle problems of small sizes. For most performance tests reported in this paper, the proposed kernels outperform state-of-the-art approaches using a K40c GPU.

Original languageEnglish
Title of host publicationHigh Performance Computing - 31st International Conference, ISC High Performance 2016, Proceedings
EditorsJack Dongarra, Julian M. Kunkel, Pavan Balaji
PublisherSpringer Verlag
Pages21-38
Number of pages18
ISBN (Print)9783319413204
DOIs
StatePublished - 2016
Event31st International Conference on High Performance Computing, ISC High Performance 2016 - Frankfurt, Germany
Duration: Jun 19 2016Jun 23 2016

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume9697
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference31st International Conference on High Performance Computing, ISC High Performance 2016
Country/TerritoryGermany
CityFrankfurt
Period06/19/1606/23/16

Funding

This work is based upon work supported by the National Science Foundation under Grants No. ACI-1339822 and CSR 1514286, NVIDIA, the Department of Energy (LLNL subcontract under DOE contract DE-AC52-07NA27344), and in part by the Russian Scientific Foundation, Agreement N14-11-00190.

FundersFunder number
National Science FoundationCSR 1514286, ACI-1339822
U.S. Department of EnergyDE-AC52-07NA27344
NVIDIA
Russian Science FoundationN14-11-00190

    Keywords

    • Autotuning
    • Batched GEMM
    • GEMM
    • GPU computing
    • HPC

    Fingerprint

    Dive into the research topics of 'Performance, design, and autotuning of batched GEMM for GPUs'. Together they form a unique fingerprint.

    Cite this