HAN: A Hierarchical AutotuNed Collective Communication Framework

Xi Luo, Wei Wu, George Bosilca, Yu Pei, Qinglei Cao, Thananon Patinyasakdikul, Dong Zhong, Jack Dongarra

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

17 Scopus citations

Abstract

High-performance computing (HPC) systems keep growing in scale and heterogeneity to satisfy the increasing computational need, and this brings new challenges to the design of MPI libraries, especially with regard to collective operations. To address these challenges, we present 'HAN,' a new hierarchical autotuned collective communication framework in Open MPI, which selects suitable homogeneous collective communication modules as submodules for each hardware level, uses collective operations from the submodules as tasks, and organizes these tasks to perform efficient hierarchical collective operations. With a task-based design, HAN can easily swap out submodules, while keeping tasks intact, to adapt to new hardware. This makes HAN suitable for the current platform and provides a strong and flexible support for future HPC systems. To provide a fast and accurate autotuning mechanism, we present a novel cost model based on benchmarking the tasks instead of a whole collective operation. This method drastically reduces tuning time, as the cost of tasks can be reused across different message sizes, and is more accurate than existing cost models. Our cost analysis suggests the autotuning component can find the optimal configuration in most cases. The evaluation of the HAN framework suggests our design significantly improves the default Open MPI and achieves decent speedups against state-of-the-art MPI implementations on tested applications.

Original languageEnglish
Title of host publicationProceedings - 2020 IEEE International Conference on Cluster Computing, CLUSTER 2020
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages23-34
Number of pages12
ISBN (Electronic)9781728166773
DOIs
StatePublished - Sep 2020
Externally publishedYes
Event22nd IEEE International Conference on Cluster Computing, CLUSTER 2020 - Kobe, Japan
Duration: Sep 14 2020Sep 17 2020

Publication series

NameProceedings - IEEE International Conference on Cluster Computing, ICCC
Volume2020-September
ISSN (Print)1552-5244

Conference

Conference22nd IEEE International Conference on Cluster Computing, CLUSTER 2020
Country/TerritoryJapan
CityKobe
Period09/14/2009/17/20

Funding

This research was supported by the Exascale Computing Project (17-SC-20-SC), a collaborative effort of the U.S. Department of Energy Office of Science and the National Nuclear Security Administration, and National Science Foundation under award EVLOVE #1664142. Experiments on the Shaheen II were supported by the Supercomputing Laboratory at KAUST, and experiments on the Stampede2 were supported by the Texas Advance Computing Center. VI. ACKNOWLEDGMENTS This research was supported by the Exascale Computing Project (17-SC-20-SC), a collaborative effort of the U.S. Department of Energy Office of Science and the National Nuclear Security Administration, and National Science Foundation under award EVLOVE #1664142. Experiments on the Shaheen II were supported by the Supercomputing Laboratory at KAUST, and experiments on the Stampede2 were supported by the Texas Advance Computing Center.

FundersFunder number
Texas Advance Computing Center
U.S. Department of Energy Office of Science
National Science Foundation1664142
National Nuclear Security Administration
King Abdullah University of Science and Technology

    Keywords

    • MPI
    • autotuning
    • cost model
    • hierarchical collective operation

    Fingerprint

    Dive into the research topics of 'HAN: A Hierarchical AutotuNed Collective Communication Framework'. Together they form a unique fingerprint.

    Cite this