TY - GEN
T1 - An Evaluation of the Effect of Network Cost Optimization for Leadership Class Supercomputers
AU - Khan, Awais
AU - Lange, John R.
AU - Hagerty, Nick
AU - Posada, Edwin F.
AU - Holmen, John
AU - White, James B.
AU - Harris, Austin
AU - Vergara, Veronica Melesse
AU - Zimmer, Christopher
AU - Atchley, Scott
N1 - Publisher Copyright:
© 2024 IEEE.
PY - 2024
Y1 - 2024
N2 - Dragonfly-based networks are an extensively deployed network topology in large-scale high-performance computing due to their cost-effectiveness and efficiency. The US will soon have three Exascale supercomputers for leadership class workloads deployed using dragonfly networks. Compared to indirect networks of similar scale, the dragonfly network has considerably reduced cable lengths, cable counts, and switch counts, resulting in significant network cost savings for a given system size, however, these cost reductions result in reduced global minimal paths and more challenging routing. Additionally, large scale dragonfly networks often require a taper at the global link level, resulting in less bisection bandwidth than is achievable in other traditional non-blocking topologies of equivalent scale. While dragonfly networks have been extensively studied, they have yet to be fully evaluated in an extreme scale (i.e., exascale) system that targets capability workloads. In this paper, we present the results of the first large scale evaluation of a dragonfly network on an exascale system (Frontier) and compare its behavior to a similar scale fat-tree network on a previous generation TOP500 system (Summit). This evaluation aims to determine the effect of network cost optimizations by measuring a tapered topology's impact on capability workloads. Our evaluation is based on a collection of synthetic microbenchmarks, mini-apps, and full scale applications. It compares the scaling efficiencies of each benchmark between the dragonfly-based Frontier and the fat-tree-based Summit systems. Our results show that a dragonfly network is ∼ 3 0 % more cost efficient than a fat-tree topology, which amortizes to ∼ 3 % of an exascale system cost. Furthermore, while tapered dragonfly networks impose significant tradeoffs, the impacts are not as broad as initially thought and are mostly seen in applications with global communication patterns, particularly all-to-all (e.g., FFT-based algorithms), but also local communication patterns (e.g., nearest-neighbor algorithms) that are sensitive to network performance variability.
AB - Dragonfly-based networks are an extensively deployed network topology in large-scale high-performance computing due to their cost-effectiveness and efficiency. The US will soon have three Exascale supercomputers for leadership class workloads deployed using dragonfly networks. Compared to indirect networks of similar scale, the dragonfly network has considerably reduced cable lengths, cable counts, and switch counts, resulting in significant network cost savings for a given system size, however, these cost reductions result in reduced global minimal paths and more challenging routing. Additionally, large scale dragonfly networks often require a taper at the global link level, resulting in less bisection bandwidth than is achievable in other traditional non-blocking topologies of equivalent scale. While dragonfly networks have been extensively studied, they have yet to be fully evaluated in an extreme scale (i.e., exascale) system that targets capability workloads. In this paper, we present the results of the first large scale evaluation of a dragonfly network on an exascale system (Frontier) and compare its behavior to a similar scale fat-tree network on a previous generation TOP500 system (Summit). This evaluation aims to determine the effect of network cost optimizations by measuring a tapered topology's impact on capability workloads. Our evaluation is based on a collection of synthetic microbenchmarks, mini-apps, and full scale applications. It compares the scaling efficiencies of each benchmark between the dragonfly-based Frontier and the fat-tree-based Summit systems. Our results show that a dragonfly network is ∼ 3 0 % more cost efficient than a fat-tree topology, which amortizes to ∼ 3 % of an exascale system cost. Furthermore, while tapered dragonfly networks impose significant tradeoffs, the impacts are not as broad as initially thought and are mostly seen in applications with global communication patterns, particularly all-to-all (e.g., FFT-based algorithms), but also local communication patterns (e.g., nearest-neighbor algorithms) that are sensitive to network performance variability.
KW - Dragonfly & Fat-tree network topologies
KW - HPC systems
KW - network cost optimization
UR - http://www.scopus.com/inward/record.url?scp=85215000889&partnerID=8YFLogxK
U2 - 10.1109/SC41406.2024.00037
DO - 10.1109/SC41406.2024.00037
M3 - Conference contribution
AN - SCOPUS:85215000889
T3 - International Conference for High Performance Computing, Networking, Storage and Analysis, SC
BT - Proceedings of SC 2024
PB - IEEE Computer Society
T2 - 2024 International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2024
Y2 - 17 November 2024 through 22 November 2024
ER -