A scalable framework for heterogeneous GPU-based clusters

Fengguang Song, Jack Dongarra

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

38 Scopus citations

Abstract

GPU-based heterogeneous clusters continue to draw attention from vendors and HPC users due to their high energy efficiency and much improved single-node computational performance, however, there is little parallel software available that can utilize all CPU cores and all GPUs on the heterogeneous system efficiently. On a heterogeneous cluster, the performance of a GPU (or a compute node) increases in a much faster rate than the performance of the PCI-Express connection (or the interconnection network) such that communication eventually becomes the bottleneck of the entire system. To overcome the bottleneck, we developed a multilevel partitioning and distribution method that guarantees a near-optimal communication volume. We have also extended heterogeneous tile algorithms to work on distributed-memory GPU clusters. Our main idea is to execute a serial program and generate hybrid-size tasks, and follow a dataflow programming model to fire the tasks on different compute nodes. We then devised a distributed dynamic scheduling runtime system to schedule tasks, and transfer data between hybrid CPU-GPU compute nodes transparently. The runtime system employs a novel distributed task-assignment protocol to solve data dependencies between tasks without coordination between processing units. The runtime system on each node consists of a number of CPU compute threads, a number of GPU compute threads, a task generation thread, an MPI communication thread, and a CUDA communication thread. By overlapping computation and communication through dynamic scheduling, we are able to attain a high performance of 75 TFlops for Cholesky factorization on the heterogeneous Keeneland system [25] using 100 nodes, each with twelve CPU cores and three GPUs. Moreover, our framework is able to attain high performance on distributed-memory clusters without GPUs, and shareds-ystem multiGPUs.

Original languageEnglish
Title of host publicationSPAA'12 - Proceedings of the 24th ACM Symposium on Parallelism in Algorithms and Architectures
Pages91-100
Number of pages10
DOIs
StatePublished - 2012
Event24th ACM Symposium on Parallelism in Algorithms and Architectures, SPAA'12 - Pittsburgh, PA, United States
Duration: Jun 25 2012Jun 27 2012

Publication series

NameAnnual ACM Symposium on Parallelism in Algorithms and Architectures

Conference

Conference24th ACM Symposium on Parallelism in Algorithms and Architectures, SPAA'12
Country/TerritoryUnited States
CityPittsburgh, PA
Period06/25/1206/27/12

Keywords

  • Distributed runtime
  • Heterogeneous clusters
  • Hybrid CPU-GPU architectures
  • Linear algebra
  • Manycore scheduling

Fingerprint

Dive into the research topics of 'A scalable framework for heterogeneous GPU-based clusters'. Together they form a unique fingerprint.

Cite this