Abstract
We present a computational framework for high-performance tensor contractions on GPUs. High-performance is di cult to obtain using existing libraries, especially for many independent contractions where each contraction is very small, e.g., sub-vector/warp in size. However, using our framework to batch contractions plus application-speci cs, we demonstrate close to peak performance results. In particular, to accelerate large scale tensor-formulated high-order nite element method (FEM) simulations, which is the main focus and motivation for this work, we represent contractions as tensor index reordering plus matrix-matrix multiplications (GEMMs). This is a key factor to achieve algorithmically many-fold acceleration (vs. not using it) due to possible reuse of data loaded in fast memory. In addition to using this context knowledge, we design tensor data-structures, tensor algebra interfaces, and new tensor contraction algorithms and implementations to achieve 90+% of a theoretically derived peak on GPUs. On a K40c GPU for contractions resulting in GEMMs on square matrices of size 8 for example, we are 2.8× faster than CUBLAS, and 8.5× faster than MKL on 16 cores of Intel Xeon E5-2670 (Sandy Bridge) 2.60GHz CPUs. Finally, we apply autotuning and code generation techniques to simplify tuning and provide an architecture-aware, user-friendly interface.
Original language | English |
---|---|
Pages (from-to) | 108-118 |
Number of pages | 11 |
Journal | Procedia Computer Science |
Volume | 80 |
DOIs | |
State | Published - 2016 |
Externally published | Yes |
Event | International Conference on Computational Science, ICCS 2016 - San Diego, United States Duration: Jun 6 2016 → Jun 8 2016 |
Funding
This material is based upon work supported by the National Science Foundation under Grants No. CSR 1514286 and ACI-1339822, NVIDIA, the Department of Energy, and in part by the Russian Scientific Foundation, Agreement N14-11-00190. This work was further performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344. LLNL release number LLNL-CONF-681073-DRAFT.
Keywords
- Applications
- Batched linear algebra
- FEM
- GPU
- Tensor HPC
- Tensor contractions