TY - GEN

T1 - Accelerating GPU kernels for dense linear algebra

AU - Nath, Rajib

AU - Tomov, Stanimire

AU - Dongarra, Jack

PY - 2011

Y1 - 2011

N2 - Implementations of the Basic Linear Algebra Subprograms (BLAS) interface are major building block of dense linear algebra (DLA) libraries, and therefore have to be highly optimized. We present some techniques and implementations that significantly accelerate the corresponding routines from currently available libraries for GPUs. In particular, Pointer Redirecting - a set of GPU specific optimization techniques - allows us to easily remove performance oscillations associated with problem dimensions not divisible by fixed blocking sizes. For example, applied to the matrix-matrix multiplication routines, depending on the hardware configuration and routine parameters, this can lead to two times faster algorithms. Similarly, the matrix-vector multiplication can be accelerated more than two times in both single and double precision arithmetic. Additionally, GPU specific acceleration techniques are applied to develop new kernels (e.g. syrk, symv) that are up to 20× faster than the currently available kernels. We present these kernels and also show their acceleration effect to higher level dense linear algebra routines. The accelerated kernels are now freely available through the MAGMA BLAS library.

AB - Implementations of the Basic Linear Algebra Subprograms (BLAS) interface are major building block of dense linear algebra (DLA) libraries, and therefore have to be highly optimized. We present some techniques and implementations that significantly accelerate the corresponding routines from currently available libraries for GPUs. In particular, Pointer Redirecting - a set of GPU specific optimization techniques - allows us to easily remove performance oscillations associated with problem dimensions not divisible by fixed blocking sizes. For example, applied to the matrix-matrix multiplication routines, depending on the hardware configuration and routine parameters, this can lead to two times faster algorithms. Similarly, the matrix-vector multiplication can be accelerated more than two times in both single and double precision arithmetic. Additionally, GPU specific acceleration techniques are applied to develop new kernels (e.g. syrk, symv) that are up to 20× faster than the currently available kernels. We present these kernels and also show their acceleration effect to higher level dense linear algebra routines. The accelerated kernels are now freely available through the MAGMA BLAS library.

KW - BLAS

KW - GEMM

KW - GPUs

UR - http://www.scopus.com/inward/record.url?scp=79952583455&partnerID=8YFLogxK

U2 - 10.1007/978-3-642-19328-6_10

DO - 10.1007/978-3-642-19328-6_10

M3 - Conference contribution

AN - SCOPUS:79952583455

SN - 9783642193279

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 83

EP - 92

BT - High Performance Computing for Computational Science, VECPAR 2010 - 9th International Conference, Revised Selected Papers

T2 - 9th International Conference on High Performance Computing for Computational Science, VECPAR 2010

Y2 - 22 June 2010 through 25 June 2010

ER -