TY - JOUR
T1 - A scalable approach to solving dense linear algebra problems on hybrid CPU-GPU systems
AU - Song, Fengguang
AU - Dongarra, Jack
N1 - Publisher Copyright:
© 2014 John Wiley & Sons, Ltd.
PY - 2015/9/25
Y1 - 2015/9/25
N2 - Aiming to fully exploit the computing power of all CPUs and all graphics processing units (GPUs) on hybrid CPU-GPU systems to solve dense linear algebra problems, we design a class of heterogeneous tile algorithms to maximize the degree of parallelism, to minimize the communication volume, and to accommodate the heterogeneity between CPUs and GPUs. The new heterogeneous tile algorithms are executed upon our decentralized dynamic scheduling runtime system, which schedules a task graph dynamically and transfers data between compute nodes automatically. The runtime system uses a new distributed task assignment protocol to solve data dependencies between tasks without any coordination between processing units. By overlapping computation and communication through dynamic scheduling, we are able to attain scalable performance for the double-precision Cholesky factorization and QR factorization. Our approach demonstrates a performance comparable to Intel MKL on shared-memory multicore systems and better performance than both vendor (e.g., Intel MKL) and open source libraries (e.g., StarPU) in the following three environments: heterogeneous clusters with GPUs, conventional clusters without GPUs, and shared-memory systems with multiple GPUs.
AB - Aiming to fully exploit the computing power of all CPUs and all graphics processing units (GPUs) on hybrid CPU-GPU systems to solve dense linear algebra problems, we design a class of heterogeneous tile algorithms to maximize the degree of parallelism, to minimize the communication volume, and to accommodate the heterogeneity between CPUs and GPUs. The new heterogeneous tile algorithms are executed upon our decentralized dynamic scheduling runtime system, which schedules a task graph dynamically and transfers data between compute nodes automatically. The runtime system uses a new distributed task assignment protocol to solve data dependencies between tasks without any coordination between processing units. By overlapping computation and communication through dynamic scheduling, we are able to attain scalable performance for the double-precision Cholesky factorization and QR factorization. Our approach demonstrates a performance comparable to Intel MKL on shared-memory multicore systems and better performance than both vendor (e.g., Intel MKL) and open source libraries (e.g., StarPU) in the following three environments: heterogeneous clusters with GPUs, conventional clusters without GPUs, and shared-memory systems with multiple GPUs.
KW - dense linear algebra
KW - distributed dataflow scheduling
KW - heterogeneous HPC systems
KW - runtime systems
UR - http://www.scopus.com/inward/record.url?scp=84941808957&partnerID=8YFLogxK
U2 - 10.1002/cpe.3403
DO - 10.1002/cpe.3403
M3 - Article
AN - SCOPUS:84941808957
SN - 1532-0626
VL - 27
SP - 3702
EP - 3723
JO - Concurrency and Computation: Practice and Experience
JF - Concurrency and Computation: Practice and Experience
IS - 14
ER -