TY - GEN
T1 - Novel HPC techniques to batch execution of many variable size BLAS computations on GPUs
AU - Abdelfattah, Ahmad
AU - Haidar, Azzam
AU - Tomov, Stanimire
AU - Dongarra, Jack
N1 - Publisher Copyright:
© 2017 Association for Computing Machinery.
PY - 2017/6/14
Y1 - 2017/6/14
N2 - This paper presents a software framework for solving large numbers of relatively small matrix problems using GPUs. Our approach combines novel and existing HPC techniques to methodically apply performance analysis, kernel design, low-level optimizations, and autotuning to exceed in performance proprietary vendor libraries. As a case study, we discuss the fundamental matrix operations defined by the Basic Linear Algebra Subprograms (BLAS) standard. This case study is significantly important for wide range of applications, including astrophysics, tensor contractions, sparse direct solvers, and others. We provide a generic design that is capable of dealing with problems of different sizes, and handling the irregularity arising from size variations. The developed solution adopts a batched computation scheme, where the same operation is concurrently applied to all matrices within a single computational kernel. The paper discusses the data layout, kernel design, and optimization techniques. We also propose a design scheme that is centralized around matrix-matrix multiplication (GEMM) kernel, so that any improvement on this particular kernel propagates automatically to other routines. Our performance results show significant speedups using a Pascal generation GPU (Tesla P100) against state-of-the-art solutions using cuBLAS, as well as against two 10-core Haswell CPUs running the MKL library. This work is part of the MAGMA library.
AB - This paper presents a software framework for solving large numbers of relatively small matrix problems using GPUs. Our approach combines novel and existing HPC techniques to methodically apply performance analysis, kernel design, low-level optimizations, and autotuning to exceed in performance proprietary vendor libraries. As a case study, we discuss the fundamental matrix operations defined by the Basic Linear Algebra Subprograms (BLAS) standard. This case study is significantly important for wide range of applications, including astrophysics, tensor contractions, sparse direct solvers, and others. We provide a generic design that is capable of dealing with problems of different sizes, and handling the irregularity arising from size variations. The developed solution adopts a batched computation scheme, where the same operation is concurrently applied to all matrices within a single computational kernel. The paper discusses the data layout, kernel design, and optimization techniques. We also propose a design scheme that is centralized around matrix-matrix multiplication (GEMM) kernel, so that any improvement on this particular kernel propagates automatically to other routines. Our performance results show significant speedups using a Pascal generation GPU (Tesla P100) against state-of-the-art solutions using cuBLAS, as well as against two 10-core Haswell CPUs running the MKL library. This work is part of the MAGMA library.
KW - Basic Linear ALgebra Subprograms
KW - Batched Computation
KW - GPU Computing
UR - http://www.scopus.com/inward/record.url?scp=85023774068&partnerID=8YFLogxK
U2 - 10.1145/3079079.3079103
DO - 10.1145/3079079.3079103
M3 - Conference contribution
AN - SCOPUS:85023774068
T3 - Proceedings of the International Conference on Supercomputing
BT - ICS 2017
PB - Association for Computing Machinery
T2 - 31st ACM International Conference on Supercomputing, ICS 2017
Y2 - 13 June 2017 through 16 June 2017
ER -