The Design and Performance of Batched BLAS on Modern High-Performance Computing Systems

Jack Dongarra, Sven Hammarling, Nicholas J. Higham, Samuel D. Relton, Pedro Valero-Lara, Mawussi Zounon

Research output: Contribution to journalConference articlepeer-review

54 Scopus citations

Abstract

A current trend in high-performance computing is to decompose a large linear algebra problem into batches containing thousands of smaller problems, that can be solved independently, before collating the results. To standardize the interface to these routines, the community is developing an extension to the BLAS standard (the batched BLAS), enabling users to perform thousands of small BLAS operations in parallel whilst making efficient use of their hardware. We discuss the benefits and drawbacks of the current batched BLAS proposals and perform a number of experiments, focusing on a general matrix-matrix multiplication (GEMM), to explore their affect on the performance. In particular we analyze the effect of novel data layouts which, for example, interleave the matrices in memory to aid vectorization and prefetching of data. Utilizing these modifications our code outperforms both MKL1 CuBLAS2 by up to 6 times on the self-hosted Intel KNL (codenamed Knights Landing) and Kepler GPU architectures, for large numbers of double precision GEMM operations using matrices of size 2 × 2 to 20 × 20.

Original languageEnglish
Pages (from-to)495-504
Number of pages10
JournalProcedia Computer Science
Volume108
DOIs
StatePublished - 2017
EventInternational Conference on Computational Science ICCS 2017 - Zurich, Switzerland
Duration: Jun 12 2017Jun 14 2017

Funding

The authors would like to thank The University of Tennessee for the use of their computational resources. This research was funded in part from the European Unions Horizon 2020 research and innovation programme under the NLAFET grant agreement No. 671633.

FundersFunder number
University of Tennessee
Horizon 2020 Framework Programme671633

    Keywords

    • BLAS
    • Batched BLAS
    • High-performance computing
    • Memory management
    • Parallel processing
    • Scientific computing

    Fingerprint

    Dive into the research topics of 'The Design and Performance of Batched BLAS on Modern High-Performance Computing Systems'. Together they form a unique fingerprint.

    Cite this