Abstract
Dragonfly-based networks are an extensively deployed network topology in large-scale high-performance computing due to their cost-effectiveness and efficiency. The US will soon have three Exascale supercomputers for leadership class workloads deployed using dragonfly networks. Compared to indirect networks of similar scale, the dragonfly network has considerably reduced cable lengths, cable counts, and switch counts, resulting in significant network cost savings for a given system size, however, these cost reductions result in reduced global minimal paths and more challenging routing. Additionally, large scale dragonfly networks often require a taper at the global link level, resulting in less bisection bandwidth than is achievable in other traditional non-blocking topologies of equivalent scale. While dragonfly networks have been extensively studied, they have yet to be fully evaluated in an extreme scale (i.e., exascale) system that targets capability workloads. In this paper, we present the results of the first large scale evaluation of a dragonfly network on an exascale system (Frontier) and compare its behavior to a similar scale fat-tree network on a previous generation TOP500 system (Summit). This evaluation aims to determine the effect of network cost optimizations by measuring a tapered topology's impact on capability workloads. Our evaluation is based on a collection of synthetic microbenchmarks, mini-apps, and full scale applications. It compares the scaling efficiencies of each benchmark between the dragonfly-based Frontier and the fat-tree-based Summit systems. Our results show that a dragonfly network is ∼ 3 0 % more cost efficient than a fat-tree topology, which amortizes to ∼ 3 % of an exascale system cost. Furthermore, while tapered dragonfly networks impose significant tradeoffs, the impacts are not as broad as initially thought and are mostly seen in applications with global communication patterns, particularly all-to-all (e.g., FFT-based algorithms), but also local communication patterns (e.g., nearest-neighbor algorithms) that are sensitive to network performance variability.
| Original language | English |
|---|---|
| Title of host publication | Proceedings of SC 2024 |
| Subtitle of host publication | International Conference for High Performance Computing, Networking, Storage and Analysis |
| Publisher | IEEE Computer Society |
| ISBN (Electronic) | 9798350352917 |
| DOIs | |
| State | Published - Nov 17 2024 |
| Event | 2024 International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2024 - Atlanta, United States Duration: Nov 17 2024 → Nov 22 2024 |
Publication series
| Name | International Conference for High Performance Computing, Networking, Storage and Analysis, SC |
|---|---|
| ISSN (Print) | 2167-4329 |
| ISSN (Electronic) | 2167-4337 |
Conference
| Conference | 2024 International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2024 |
|---|---|
| Country/Territory | United States |
| City | Atlanta |
| Period | 11/17/24 → 11/22/24 |
Funding
This research used resources of the Oak Ridge Leadership Computing Facility, located in the National Center for Computational Sciences at the Oak Ridge National Laboratory, which is supported by the Office of Science of the DOE under Contract DE-AC05-00OR22725. This work would also not have been possible without the contributions of Stephen Nichols in providing the M-PSDNS configurations and the MPI-related code.
Keywords
- Dragonfly & Fat-tree network topologies
- HPC systems
- network cost optimization
Fingerprint
Dive into the research topics of 'An Evaluation of the Effect of Network Cost Optimization for Leadership Class Supercomputers'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver