An Evaluation of the Effect of Network Cost Optimization for Leadership Class Supercomputers

Awais Khan, John R. Lange, Nick Hagerty, Edwin F. Posada, John Holmen, James B. White, Austin Harris, Veronica Melesse Vergara, Christopher Zimmer, Scott Atchley

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Dragonfly-based networks are an extensively deployed network topology in large-scale high-performance computing due to their cost-effectiveness and efficiency. The US will soon have three Exascale supercomputers for leadership class workloads deployed using dragonfly networks. Compared to indirect networks of similar scale, the dragonfly network has considerably reduced cable lengths, cable counts, and switch counts, resulting in significant network cost savings for a given system size, however, these cost reductions result in reduced global minimal paths and more challenging routing. Additionally, large scale dragonfly networks often require a taper at the global link level, resulting in less bisection bandwidth than is achievable in other traditional non-blocking topologies of equivalent scale. While dragonfly networks have been extensively studied, they have yet to be fully evaluated in an extreme scale (i.e., exascale) system that targets capability workloads. In this paper, we present the results of the first large scale evaluation of a dragonfly network on an exascale system (Frontier) and compare its behavior to a similar scale fat-tree network on a previous generation TOP500 system (Summit). This evaluation aims to determine the effect of network cost optimizations by measuring a tapered topology's impact on capability workloads. Our evaluation is based on a collection of synthetic microbenchmarks, mini-apps, and full scale applications. It compares the scaling efficiencies of each benchmark between the dragonfly-based Frontier and the fat-tree-based Summit systems. Our results show that a dragonfly network is ∼ 3 0 % more cost efficient than a fat-tree topology, which amortizes to ∼ 3 % of an exascale system cost. Furthermore, while tapered dragonfly networks impose significant tradeoffs, the impacts are not as broad as initially thought and are mostly seen in applications with global communication patterns, particularly all-to-all (e.g., FFT-based algorithms), but also local communication patterns (e.g., nearest-neighbor algorithms) that are sensitive to network performance variability.

Original languageEnglish
Title of host publicationProceedings of SC 2024
Subtitle of host publicationInternational Conference for High Performance Computing, Networking, Storage and Analysis
PublisherIEEE Computer Society
ISBN (Electronic)9798350352917
DOIs
StatePublished - 2024
Event2024 International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2024 - Atlanta, United States
Duration: Nov 17 2024Nov 22 2024

Publication series

NameInternational Conference for High Performance Computing, Networking, Storage and Analysis, SC
ISSN (Print)2167-4329
ISSN (Electronic)2167-4337

Conference

Conference2024 International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2024
Country/TerritoryUnited States
CityAtlanta
Period11/17/2411/22/24

Keywords

  • Dragonfly & Fat-tree network topologies
  • HPC systems
  • network cost optimization

Fingerprint

Dive into the research topics of 'An Evaluation of the Effect of Network Cost Optimization for Leadership Class Supercomputers'. Together they form a unique fingerprint.

Cite this