TY - GEN
T1 - A scalable framework for heterogeneous GPU-based clusters
AU - Song, Fengguang
AU - Dongarra, Jack
PY - 2012
Y1 - 2012
N2 - GPU-based heterogeneous clusters continue to draw attention from vendors and HPC users due to their high energy efficiency and much improved single-node computational performance, however, there is little parallel software available that can utilize all CPU cores and all GPUs on the heterogeneous system efficiently. On a heterogeneous cluster, the performance of a GPU (or a compute node) increases in a much faster rate than the performance of the PCI-Express connection (or the interconnection network) such that communication eventually becomes the bottleneck of the entire system. To overcome the bottleneck, we developed a multilevel partitioning and distribution method that guarantees a near-optimal communication volume. We have also extended heterogeneous tile algorithms to work on distributed-memory GPU clusters. Our main idea is to execute a serial program and generate hybrid-size tasks, and follow a dataflow programming model to fire the tasks on different compute nodes. We then devised a distributed dynamic scheduling runtime system to schedule tasks, and transfer data between hybrid CPU-GPU compute nodes transparently. The runtime system employs a novel distributed task-assignment protocol to solve data dependencies between tasks without coordination between processing units. The runtime system on each node consists of a number of CPU compute threads, a number of GPU compute threads, a task generation thread, an MPI communication thread, and a CUDA communication thread. By overlapping computation and communication through dynamic scheduling, we are able to attain a high performance of 75 TFlops for Cholesky factorization on the heterogeneous Keeneland system [25] using 100 nodes, each with twelve CPU cores and three GPUs. Moreover, our framework is able to attain high performance on distributed-memory clusters without GPUs, and shareds-ystem multiGPUs.
AB - GPU-based heterogeneous clusters continue to draw attention from vendors and HPC users due to their high energy efficiency and much improved single-node computational performance, however, there is little parallel software available that can utilize all CPU cores and all GPUs on the heterogeneous system efficiently. On a heterogeneous cluster, the performance of a GPU (or a compute node) increases in a much faster rate than the performance of the PCI-Express connection (or the interconnection network) such that communication eventually becomes the bottleneck of the entire system. To overcome the bottleneck, we developed a multilevel partitioning and distribution method that guarantees a near-optimal communication volume. We have also extended heterogeneous tile algorithms to work on distributed-memory GPU clusters. Our main idea is to execute a serial program and generate hybrid-size tasks, and follow a dataflow programming model to fire the tasks on different compute nodes. We then devised a distributed dynamic scheduling runtime system to schedule tasks, and transfer data between hybrid CPU-GPU compute nodes transparently. The runtime system employs a novel distributed task-assignment protocol to solve data dependencies between tasks without coordination between processing units. The runtime system on each node consists of a number of CPU compute threads, a number of GPU compute threads, a task generation thread, an MPI communication thread, and a CUDA communication thread. By overlapping computation and communication through dynamic scheduling, we are able to attain a high performance of 75 TFlops for Cholesky factorization on the heterogeneous Keeneland system [25] using 100 nodes, each with twelve CPU cores and three GPUs. Moreover, our framework is able to attain high performance on distributed-memory clusters without GPUs, and shareds-ystem multiGPUs.
KW - Distributed runtime
KW - Heterogeneous clusters
KW - Hybrid CPU-GPU architectures
KW - Linear algebra
KW - Manycore scheduling
UR - http://www.scopus.com/inward/record.url?scp=84864149777&partnerID=8YFLogxK
U2 - 10.1145/2312005.2312025
DO - 10.1145/2312005.2312025
M3 - Conference contribution
AN - SCOPUS:84864149777
SN - 9781450312134
T3 - Annual ACM Symposium on Parallelism in Algorithms and Architectures
SP - 91
EP - 100
BT - SPAA'12 - Proceedings of the 24th ACM Symposium on Parallelism in Algorithms and Architectures
T2 - 24th ACM Symposium on Parallelism in Algorithms and Architectures, SPAA'12
Y2 - 25 June 2012 through 27 June 2012
ER -