TY - GEN
T1 - Energy efficiency and performance frontiers for sparse computations on GPU supercomputers
AU - Anzt, Hartwig
AU - Tomov, Stanimire
AU - Dongarra, Jack
N1 - Publisher Copyright:
Copyright © 2015 ACM.
PY - 2015/2/7
Y1 - 2015/2/7
N2 - In this paper we unveil some energy efficiency and performance frontiers for sparse computations on GPU-based supercomputers. To do this, we consider state-of-the-art implementations of the sparse matrix-vector (SpMV) product in libraries like cuSPARSE, MKL, and MAGMA, and their use in the LOBPCG eigen-solver. LOBPCG is chosen as a benchmark for this study as it combines an interesting mix of sparse and dense linear algebra operations with potential for hardware-aware optimizations. Most notably, LOBPCG includes a blocking technique that is a common performance optimization for many applications. In particular, multiple memory-bound SpMV operations are blocked into a SpM-matrix product (SpMM), that achieves significantly higher performance than a sequence of SpMVs. We provide details about the GPU kernels we use for the SpMV, SpMM, and the LOBPCG implementation design, and study performance and energy consumption compared to CPU solutions. While a typical sparse computation like the SpMV reaches only a fraction of the peak of current GPUs, we show that the SpMM achieves up to a 6× performance improvement over the GPU's SpMV, and the GPU-accelerated LOBPCG based on this kernel is 3 to 5× faster than multicore CPUs with the same power draw, e.g., a K40 GPU vs. two Sandy Bridge CPUs (16 cores). In practice though, we show that currently available CPU implementations are much slower due to missed optimization opportunities. These performance results translate to similar improvements in energy consumption, and are indicative of today's frontiers in energy efficiency and performance for sparse computations on supercomputers.
AB - In this paper we unveil some energy efficiency and performance frontiers for sparse computations on GPU-based supercomputers. To do this, we consider state-of-the-art implementations of the sparse matrix-vector (SpMV) product in libraries like cuSPARSE, MKL, and MAGMA, and their use in the LOBPCG eigen-solver. LOBPCG is chosen as a benchmark for this study as it combines an interesting mix of sparse and dense linear algebra operations with potential for hardware-aware optimizations. Most notably, LOBPCG includes a blocking technique that is a common performance optimization for many applications. In particular, multiple memory-bound SpMV operations are blocked into a SpM-matrix product (SpMM), that achieves significantly higher performance than a sequence of SpMVs. We provide details about the GPU kernels we use for the SpMV, SpMM, and the LOBPCG implementation design, and study performance and energy consumption compared to CPU solutions. While a typical sparse computation like the SpMV reaches only a fraction of the peak of current GPUs, we show that the SpMM achieves up to a 6× performance improvement over the GPU's SpMV, and the GPU-accelerated LOBPCG based on this kernel is 3 to 5× faster than multicore CPUs with the same power draw, e.g., a K40 GPU vs. two Sandy Bridge CPUs (16 cores). In practice though, we show that currently available CPU implementations are much slower due to missed optimization opportunities. These performance results translate to similar improvements in energy consumption, and are indicative of today's frontiers in energy efficiency and performance for sparse computations on supercomputers.
KW - Blocked sparse matrix vector product
KW - Energy efficiency
KW - GPU supercomputer
KW - LOBPCG
KW - Sparse eigensolver
UR - http://www.scopus.com/inward/record.url?scp=84938104807&partnerID=8YFLogxK
U2 - 10.1145/2712386.2712387
DO - 10.1145/2712386.2712387
M3 - Conference contribution
AN - SCOPUS:84938104807
T3 - Proceedings of the 6th International Workshop on Programming Models and Applications for Multicores and Manycores, PMAM 2015
SP - 1
EP - 10
BT - Proceedings of the 6th International Workshop on Programming Models and Applications for Multicores and Manycores, PMAM 2015
A2 - Balaji, Pavan
A2 - Guo, Minyi
A2 - Huang, Zhiyi
PB - Association for Computing Machinery
T2 - 6th International Workshop on Programming Models and Applications for Multicores and Manycores, PMAM 2015
Y2 - 7 February 2015 through 8 February 2015
ER -