Autotuning GEMM kernels for the fermi GPU

Jakub Kurzak, Stanimire Tomov, Jack Dongarra

Research output: Contribution to journalArticlepeer-review

95 Scopus citations

Abstract

In recent years, the use of graphics chips has been recognized as a viable way of accelerating scientific and engineering applications, even more so since the introduction of the Fermi architecture by NVIDIA, with features essential to numerical computing, such as fast double precision arithmetic and memory protected with error correction codes. Being the crucial component of numerical software packages, such as LAPACK and ScaLAPACK, the general dense matrix multiplication routine is one of the more important workloads to be implemented on these devices. This paper presents a methodology for producing matrix multiplication kernels tuned for a specific architecture, through a canonical process of heuristic autotuning, based on generation of multiple code variants and selecting the fastest ones through benchmarking. The key contribution of this work is in the method for generating the search space; specifically, pruning it to a manageable size. Performance numbers match or exceed other available implementations.

Original languageEnglish
Article number6122021
Pages (from-to)2045-2057
Number of pages13
JournalIEEE Transactions on Parallel and Distributed Systems
Volume23
Issue number11
DOIs
StatePublished - 2012
Externally publishedYes

Funding

This work was supported by DOE grant #DE-SC0003852, “Architecture-aware Algorithms for Scalable Performance and Resilience on Heterogeneous Architectures,” DOE grant #DE-SC0004983, “Matrix Algebra for GPU and Multicore Architectures (MAGMA) for Large Petascale Systems,” Georgia Institute of Technology subcontract #RA241-G1 funded by NSF grant #OCI-0910735, “Keene-land: National Institute for Experimental Computing.” The authors would like to thank David Luebke, Steven Parker, and Massimiliano Fatica for their insightful comments about the Fermi architecture.

FundersFunder number
National Science Foundation-0910735
U.S. Department of Energy-SC0003852, -SC0004983, 241-G1

    Keywords

    • BLAS
    • CUDA
    • GEMM
    • Graphics processing unit
    • automatic tuning
    • code generation
    • matrix multiplication

    Fingerprint

    Dive into the research topics of 'Autotuning GEMM kernels for the fermi GPU'. Together they form a unique fingerprint.

    Cite this