ADAPT: An event-based adaptive collective communication framework

Xi Luo, Thananon Patinyasakdikul, Wei Wu, Linnan Wang, George Bosilca, Jack Dongarra

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

18 Scopus citations

Abstract

The increase in scale and heterogeneity of high-performance computing (HPC) systems predispose the performance of Message Passing Interface (MPI) collective communications to be susceptible to noise, and to adapt to a complex mix of hardware capabilities. The designs of state of the art MPI collectives heavily rely on synchronizations; these designs magnify noise across the participating processes, resulting in significant performance slowdown. Therefore, such design philosophy must be reconsidered to efficiently and robustly run on the large-scale heterogeneous platforms. In this paper, we present ADAPT, a new collective communication framework in Open MPI, using event-driven techniques to morph collective algorithms to heterogeneous environments. The core concept of ADAPT is to relax synchronizations, while maintaining the minimal data dependencies of MPI collectives. To fully exploit the different bandwidths of data movement lanes in heterogeneous systems, we extend the ADAPT collective framework with a topology-aware communication tree. This removes the boundaries of different hardware topologies while maximizing the speed of data movements. We evaluate our framework with two popular collective operations: broadcast and reduce on both CPU and GPU clusters. Our results demonstrate drastic performance improvements and a strong resistance against noise compared to other state of the art MPI libraries. In particular, we demonstrate at least 1.3× and 1.5× speedup for CPU data and 2× and 10× speedup for GPU data using ADAPT event-based broadcast and reduce operations.

Original languageEnglish
Title of host publicationHPDC 2018 - Proceedings of the 2018 International Symposium on High-Performance Parallel and Distributed Computing
PublisherAssociation for Computing Machinery, Inc
Pages118-130
Number of pages13
ISBN (Electronic)9781450357852
DOIs
StatePublished - Jun 11 2018
Event27th ACM International Symposium on High-Performance Parallel and Distributed Computing, HPDC 2018 - Tempe, United States
Duration: Jun 11 2018Jun 15 2018

Publication series

NameHPDC 2018 - Proceedings of the 2018 International Symposium on High-Performance Parallel and Distributed Computing

Conference

Conference27th ACM International Symposium on High-Performance Parallel and Distributed Computing, HPDC 2018
Country/TerritoryUnited States
CityTempe
Period06/11/1806/15/18

Keywords

  • Collectives operations
  • Event-driven
  • GPU
  • Het-erogeneous system
  • MPI
  • System noise

Fingerprint

Dive into the research topics of 'ADAPT: An event-based adaptive collective communication framework'. Together they form a unique fingerprint.

Cite this