Abstract
The Cray HPE Slingshot 11 network is used on the new exascale systems arriving at the U.S. Department of Energy (DoE) laboratories (e.g., Frontier, Aurora, Perlmutter). As such, the support of this network is an important capability to meet the needs of exascale applications. This article highlights recent work to develop supporting infrastructure to enable Open MPI to efficiently support these new platforms. A key component of this effort involves development of a new Open Fabrics Interface (OFI) provider, LinkX. We discuss the design and development of enhancements that take advantage of the new Slingshot 11 network and AMD GPUs. We include performance data from tests on the Frontier supercomputer using synthetic communication benchmarks, and the vendor provided MPI as a baseline for comparison. The tests demonstrate full functionality of Open MPI on the system and initial results show favorable performance when compared to the highly tuned vendor implementation.
Original language | English |
---|---|
Article number | e8203 |
Journal | Concurrency and Computation: Practice and Experience |
Volume | 36 |
Issue number | 22 |
DOIs | |
State | Published - Oct 10 2024 |
Funding
This research used resources of the Oak Ridge Leadership Computing Facility at the Oak Ridge National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC05-00OR22725. Howard Pritchard acknowledges support by the National Nuclear Security Administration. Los Alamos National Laboratory is operated by Triad National Security, LLC for the U.S. Department of Energy. This research was partially supported by the Exascale Computing Project (17-SC-20-SC), a collaborative effort of the U.S. Department of Energy Office of Science and the National Nuclear Security Administration under Contract No. 89233218CNA000001. This manuscript has been authored in part by UT-Battelle, LLC under Contract No. DE-AC05-00OR22725 with the U.S. Department of Energy. The United States Government retains and the publisher, by accepting the article for publication, acknowledges that the United States Government retains a non-exclusive, paid-up, irrevocable, world-wide license to publish or reproduce the published form of this manuscript, or allow others to do so, for United States Government purposes. The Department of Energy will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan (http://energy.gov/downloads/doe-public-access-plan). This research used resources of the Oak Ridge Leadership Computing Facility at the Oak Ridge National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE\u2010AC05\u201000OR22725. Howard Pritchard acknowledges support by the National Nuclear Security Administration. Los Alamos National Laboratory is operated by Triad National Security, LLC for the U.S. Department of Energy. This research was partially supported by the Exascale Computing Project (17\u2010SC\u201020\u2010SC), a collaborative effort of the U.S. Department of Energy Office of Science and the National Nuclear Security Administration under Contract No. 89233218CNA000001. This manuscript has been authored in part by UT\u2010Battelle, LLC under Contract No. DE\u2010AC05\u201000OR22725 with the U.S. Department of Energy. The United States Government retains and the publisher, by accepting the article for publication, acknowledges that the United States Government retains a non\u2010exclusive, paid\u2010up, irrevocable, world\u2010wide license to publish or reproduce the published form of this manuscript, or allow others to do so, for United States Government purposes. The Department of Energy will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan ( http://energy.gov/downloads/doe\u2010public\u2010access\u2010plan ).
Funders | Funder number |
---|---|
United States Government | |
DOE Public Access Plan | |
U.S. Department of Energy | DE‐AC05‐00OR22725 |
National Nuclear Security Administration | 17‐SC‐20‐SC |
Office of Science | 89233218CNA000001 |
Keywords
- Open MPI
- Slingshot
- high performance computing
- libfabric
- message passing interface