Abstract
The use of the general dense matrix-matrix multiplication (GEMM) is fundamental for obtaining high performance in many scientific computing applications. GEMMs for small matrices (of sizes less than 32) however, are not sufficiently optimized in existing libraries. In this paper we consider the case of many small GEMMs on either CPU or GPU architectures. This is a case that often occurs in applications like big data analytics, machine learning, high-order FEM, and others. The GEMMs are grouped together in a single batched routine. We present specialized for these cases algorithms and optimization techniques to obtain performance that is within 90% of the optimal. We show that these results outperform currently available state-of-the-art implementations and vendor-tuned math libraries.
Original language | English |
---|---|
Title of host publication | Parallel Processing - 22nd International Conference on Parallel and Distributed Computing, Euro-Par 2016, Proceedings |
Editors | Pierre-François Dutot, Denis Trystram |
Publisher | Springer Verlag |
Pages | 659-671 |
Number of pages | 13 |
ISBN (Print) | 9783319436586 |
DOIs | |
State | Published - 2016 |
Externally published | Yes |
Event | 22nd International Conference on Parallel and Distributed Computing, Euro-Par 2016 - Grenoble, France Duration: Aug 24 2016 → Aug 26 2016 |
Publication series
Name | Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) |
---|---|
Volume | 9833 LNCS |
ISSN (Print) | 0302-9743 |
ISSN (Electronic) | 1611-3349 |
Conference
Conference | 22nd International Conference on Parallel and Distributed Computing, Euro-Par 2016 |
---|---|
Country/Territory | France |
City | Grenoble |
Period | 08/24/16 → 08/26/16 |
Funding
This material is based in part upon work supported by the US NSF under Grants No. CSR 1514286 and ACI-1339822, NVIDIA, the Department of Energy, and in part by the Russian Scientific Foundation, Agreement N14-11-00190.
Keywords
- Autotuning
- Batched GEMM
- GEMM
- HPC
- Small matrices