TY - GEN
T1 - Toward a scalable multi-GPU eigensolver via compute-intensive kernels and efficient communication
AU - Haidar, Azzam
AU - Gates, Mark
AU - Tomov, Stan
AU - Dongarra, Jack
PY - 2013
Y1 - 2013
N2 - The enormous gap between the high-performance capabilities of GPUs and the slow interconnect between them has made the development of numerical software that is scalable across multiple GPUs extremely challenging. We describe a successful methodology on how to address the challenges - starting from our algorithm design, kernel optimization and tuning, to our programming model - in the development of a scalable high-performance tridiagonal reduction algorithm for the symmetric eigenvalue problem. This is a fundamental linear algebra problem with many engineering and physics applications. We use a combination of a task-based approach to parallelism and a new algorithmic design to achieve high performance. The goal of the new design is to increase the computational intensity of the major compute kernels and to reduce synchronization and data transfers between GPUs. This may increase the number of flops, but the increase is offset by the more efficient execution and reduced data transfers. Our performance results are the best available, providing an enormous performance boost compared to current state-of-the-art solutions. In particular, our software scales up to 1070 Gflop/s using 16 Intel E5-2670 cores and eight M2090 GPUs, compared to 45 Gflop/s achieved by the optimized Intel Math Kernel Library (MKL) using only the 16 CPU cores.
AB - The enormous gap between the high-performance capabilities of GPUs and the slow interconnect between them has made the development of numerical software that is scalable across multiple GPUs extremely challenging. We describe a successful methodology on how to address the challenges - starting from our algorithm design, kernel optimization and tuning, to our programming model - in the development of a scalable high-performance tridiagonal reduction algorithm for the symmetric eigenvalue problem. This is a fundamental linear algebra problem with many engineering and physics applications. We use a combination of a task-based approach to parallelism and a new algorithmic design to achieve high performance. The goal of the new design is to increase the computational intensity of the major compute kernels and to reduce synchronization and data transfers between GPUs. This may increase the number of flops, but the increase is offset by the more efficient execution and reduced data transfers. Our performance results are the best available, providing an enormous performance boost compared to current state-of-the-art solutions. In particular, our software scales up to 1070 Gflop/s using 16 Intel E5-2670 cores and eight M2090 GPUs, compared to 45 Gflop/s achieved by the optimized Intel Math Kernel Library (MKL) using only the 16 CPU cores.
KW - eigenvalue
KW - gpu communication
KW - gpu computation
KW - heterogeneous programming model
KW - performance
KW - reduction to tridiagonal
KW - singular value decomposiiton
KW - task parallelism
UR - http://www.scopus.com/inward/record.url?scp=84879806030&partnerID=8YFLogxK
U2 - 10.1145/2464996.2465438
DO - 10.1145/2464996.2465438
M3 - Conference contribution
AN - SCOPUS:84879806030
SN - 9781450321303
T3 - Proceedings of the International Conference on Supercomputing
SP - 223
EP - 232
BT - ICS 2013 - Proceedings of the 2013 ACM International Conference on Supercomputing
T2 - 27th ACM International Conference on Supercomputing, ICS 2013
Y2 - 10 June 2013 through 14 June 2013
ER -