Bringing HPE Slingshot 11 support to Open MPI

Research output: Contribution to journalArticlepeer-review

1 Scopus citations

Abstract

The Cray HPE Slingshot 11 network is used on the new exascale systems arriving at the U.S. Department of Energy (DoE) laboratories (e.g., Frontier, Aurora, Perlmutter). As such, the support of this network is an important capability to meet the needs of exascale applications. This article highlights recent work to develop supporting infrastructure to enable Open MPI to efficiently support these new platforms. A key component of this effort involves development of a new Open Fabrics Interface (OFI) provider, LinkX. We discuss the design and development of enhancements that take advantage of the new Slingshot 11 network and AMD GPUs. We include performance data from tests on the Frontier supercomputer using synthetic communication benchmarks, and the vendor provided MPI as a baseline for comparison. The tests demonstrate full functionality of Open MPI on the system and initial results show favorable performance when compared to the highly tuned vendor implementation.

Original languageEnglish
Article numbere8203
JournalConcurrency and Computation: Practice and Experience
Volume36
Issue number22
DOIs
StatePublished - Oct 10 2024

Funding

This research used resources of the Oak Ridge Leadership Computing Facility at the Oak Ridge National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC05-00OR22725. Howard Pritchard acknowledges support by the National Nuclear Security Administration. Los Alamos National Laboratory is operated by Triad National Security, LLC for the U.S. Department of Energy. This research was partially supported by the Exascale Computing Project (17-SC-20-SC), a collaborative effort of the U.S. Department of Energy Office of Science and the National Nuclear Security Administration under Contract No. 89233218CNA000001. This manuscript has been authored in part by UT-Battelle, LLC under Contract No. DE-AC05-00OR22725 with the U.S. Department of Energy. The United States Government retains and the publisher, by accepting the article for publication, acknowledges that the United States Government retains a non-exclusive, paid-up, irrevocable, world-wide license to publish or reproduce the published form of this manuscript, or allow others to do so, for United States Government purposes. The Department of Energy will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan (http://energy.gov/downloads/doe-public-access-plan). This research used resources of the Oak Ridge Leadership Computing Facility at the Oak Ridge National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE\u2010AC05\u201000OR22725. Howard Pritchard acknowledges support by the National Nuclear Security Administration. Los Alamos National Laboratory is operated by Triad National Security, LLC for the U.S. Department of Energy. This research was partially supported by the Exascale Computing Project (17\u2010SC\u201020\u2010SC), a collaborative effort of the U.S. Department of Energy Office of Science and the National Nuclear Security Administration under Contract No. 89233218CNA000001. This manuscript has been authored in part by UT\u2010Battelle, LLC under Contract No. DE\u2010AC05\u201000OR22725 with the U.S. Department of Energy. The United States Government retains and the publisher, by accepting the article for publication, acknowledges that the United States Government retains a non\u2010exclusive, paid\u2010up, irrevocable, world\u2010wide license to publish or reproduce the published form of this manuscript, or allow others to do so, for United States Government purposes. The Department of Energy will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan ( http://energy.gov/downloads/doe\u2010public\u2010access\u2010plan ).

FundersFunder number
United States Government
DOE Public Access Plan
U.S. Department of EnergyDE‐AC05‐00OR22725
National Nuclear Security Administration17‐SC‐20‐SC
Office of Science89233218CNA000001

    Keywords

    • Open MPI
    • Slingshot
    • high performance computing
    • libfabric
    • message passing interface

    Fingerprint

    Dive into the research topics of 'Bringing HPE Slingshot 11 support to Open MPI'. Together they form a unique fingerprint.

    Cite this