Novel HPC techniques to batch execution of many variable size BLAS computations on GPUs

Ahmad Abdelfattah, Azzam Haidar, Stanimire Tomov, Jack Dongarra

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

16 Scopus citations

Abstract

This paper presents a software framework for solving large numbers of relatively small matrix problems using GPUs. Our approach combines novel and existing HPC techniques to methodically apply performance analysis, kernel design, low-level optimizations, and autotuning to exceed in performance proprietary vendor libraries. As a case study, we discuss the fundamental matrix operations defined by the Basic Linear Algebra Subprograms (BLAS) standard. This case study is significantly important for wide range of applications, including astrophysics, tensor contractions, sparse direct solvers, and others. We provide a generic design that is capable of dealing with problems of different sizes, and handling the irregularity arising from size variations. The developed solution adopts a batched computation scheme, where the same operation is concurrently applied to all matrices within a single computational kernel. The paper discusses the data layout, kernel design, and optimization techniques. We also propose a design scheme that is centralized around matrix-matrix multiplication (GEMM) kernel, so that any improvement on this particular kernel propagates automatically to other routines. Our performance results show significant speedups using a Pascal generation GPU (Tesla P100) against state-of-the-art solutions using cuBLAS, as well as against two 10-core Haswell CPUs running the MKL library. This work is part of the MAGMA library.

Original languageEnglish
Title of host publicationICS 2017
Subtitle of host publicationInternational Conference on Supercomputing
PublisherAssociation for Computing Machinery
ISBN (Electronic)9781450350204
DOIs
StatePublished - Jun 14 2017
Externally publishedYes
Event31st ACM International Conference on Supercomputing, ICS 2017 - Chicago, United States
Duration: Jun 13 2017Jun 16 2017

Publication series

NameProceedings of the International Conference on Supercomputing
VolumePart F128411

Conference

Conference31st ACM International Conference on Supercomputing, ICS 2017
Country/TerritoryUnited States
CityChicago
Period06/13/1706/16/17

Funding

FundersFunder number
National Science Foundation1514406

    Keywords

    • Basic Linear ALgebra Subprograms
    • Batched Computation
    • GPU Computing

    Fingerprint

    Dive into the research topics of 'Novel HPC techniques to batch execution of many variable size BLAS computations on GPUs'. Together they form a unique fingerprint.

    Cite this