Optimized batched linear Algebra for modern architectures

Jack Dongarra, Sven Hammarling, Nicholas J. Higham, Samuel D. Relton, Mawussi Zounon

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

9 Scopus citations

Abstract

Solving large numbers of small linear algebra problems simultaneously is becoming increasingly important in many application areas. Whilst many researchers have investigated the design of efficient batch linear algebra kernels for GPU architectures, the common approach for many/multi-core CPUs is to use one core per subproblem in the batch. When solving batches of very small matrices, for example, this design exhibits two main issues: it fails to fully utilize the vector units and the cache of modern architectures, since the matrices are too small. Our approach to resolve this is as follows: given a batch of small matrices spread throughout the primary memory, we first reorganize the elements of the matrices into a contiguous array, using a block interleaved memory format, which allows us to process the small independent problems as a single large matrix problem and enables cross-matrix vectorization. The large problem is solved using blocking strategies that attempt to optimize the use of the cache. The solution is then converted back to the original storage format. To explain our approach we focus on two BLAS routines: general matrix-matrix multiplication (GEMM) and the triangular solve (TRSM). We extend this idea to LAPACK routines using the Cholesky factorization and solve (POSV). Our focus is primarily on very small matrices ranging in size from x 32. Compared to both MKL and OpenMP implementations, our approach can be up to 4 times faster for GEMM, up to 14 times faster for TRSM, and up to 40 times faster for POSV on the new Intel Xeon Phi processor, code-named Knights Landing (KNL). Furthermore, we discuss strategies to avoid data movement between sockets when using our interleaved approach on a NUMA node.

Original languageEnglish
Title of host publicationEuro-Par 2017
Subtitle of host publicationParallel Processing - 23rd International Conference on Parallel and Distributed Computing, Proceedings
EditorsFrancisco F. Rivera, Tomas F. Pena, Jose C. Cabaleiro
PublisherSpringer Verlag
Pages511-522
Number of pages12
ISBN (Print)9783319642024
DOIs
StatePublished - 2017
Event23rd International Conference on Parallel and Distributed Computing, Euro-Par 2017 - Santiago de Compostela, Spain
Duration: Aug 28 2017Sep 1 2017

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume10417 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference23rd International Conference on Parallel and Distributed Computing, Euro-Par 2017
Country/TerritorySpain
CitySantiago de Compostela
Period08/28/1709/1/17

Funding

Acknowledgements. The authors would like to thank The University of Tennessee for the use of their computational resources. This research was funded in part from the European Union’s Horizon 2020 research and innovation programme under the NLAFET grant agreement No. 671633.

Fingerprint

Dive into the research topics of 'Optimized batched linear Algebra for modern architectures'. Together they form a unique fingerprint.

Cite this