TY - GEN
T1 - ADAPT
T2 - 27th ACM International Symposium on High-Performance Parallel and Distributed Computing, HPDC 2018
AU - Luo, Xi
AU - Patinyasakdikul, Thananon
AU - Wu, Wei
AU - Wang, Linnan
AU - Bosilca, George
AU - Dongarra, Jack
N1 - Publisher Copyright:
© 2018 Association for Computing Machinery.
PY - 2018/6/11
Y1 - 2018/6/11
N2 - The increase in scale and heterogeneity of high-performance computing (HPC) systems predispose the performance of Message Passing Interface (MPI) collective communications to be susceptible to noise, and to adapt to a complex mix of hardware capabilities. The designs of state of the art MPI collectives heavily rely on synchronizations; these designs magnify noise across the participating processes, resulting in significant performance slowdown. Therefore, such design philosophy must be reconsidered to efficiently and robustly run on the large-scale heterogeneous platforms. In this paper, we present ADAPT, a new collective communication framework in Open MPI, using event-driven techniques to morph collective algorithms to heterogeneous environments. The core concept of ADAPT is to relax synchronizations, while maintaining the minimal data dependencies of MPI collectives. To fully exploit the different bandwidths of data movement lanes in heterogeneous systems, we extend the ADAPT collective framework with a topology-aware communication tree. This removes the boundaries of different hardware topologies while maximizing the speed of data movements. We evaluate our framework with two popular collective operations: broadcast and reduce on both CPU and GPU clusters. Our results demonstrate drastic performance improvements and a strong resistance against noise compared to other state of the art MPI libraries. In particular, we demonstrate at least 1.3× and 1.5× speedup for CPU data and 2× and 10× speedup for GPU data using ADAPT event-based broadcast and reduce operations.
AB - The increase in scale and heterogeneity of high-performance computing (HPC) systems predispose the performance of Message Passing Interface (MPI) collective communications to be susceptible to noise, and to adapt to a complex mix of hardware capabilities. The designs of state of the art MPI collectives heavily rely on synchronizations; these designs magnify noise across the participating processes, resulting in significant performance slowdown. Therefore, such design philosophy must be reconsidered to efficiently and robustly run on the large-scale heterogeneous platforms. In this paper, we present ADAPT, a new collective communication framework in Open MPI, using event-driven techniques to morph collective algorithms to heterogeneous environments. The core concept of ADAPT is to relax synchronizations, while maintaining the minimal data dependencies of MPI collectives. To fully exploit the different bandwidths of data movement lanes in heterogeneous systems, we extend the ADAPT collective framework with a topology-aware communication tree. This removes the boundaries of different hardware topologies while maximizing the speed of data movements. We evaluate our framework with two popular collective operations: broadcast and reduce on both CPU and GPU clusters. Our results demonstrate drastic performance improvements and a strong resistance against noise compared to other state of the art MPI libraries. In particular, we demonstrate at least 1.3× and 1.5× speedup for CPU data and 2× and 10× speedup for GPU data using ADAPT event-based broadcast and reduce operations.
KW - Collectives operations
KW - Event-driven
KW - GPU
KW - Het-erogeneous system
KW - MPI
KW - System noise
UR - http://www.scopus.com/inward/record.url?scp=85050079178&partnerID=8YFLogxK
U2 - 10.1145/3208040.3208054
DO - 10.1145/3208040.3208054
M3 - Conference contribution
AN - SCOPUS:85050079178
T3 - HPDC 2018 - Proceedings of the 2018 International Symposium on High-Performance Parallel and Distributed Computing
SP - 118
EP - 130
BT - HPDC 2018 - Proceedings of the 2018 International Symposium on High-Performance Parallel and Distributed Computing
PB - Association for Computing Machinery, Inc
Y2 - 11 June 2018 through 15 June 2018
ER -