Abstract
The US Department of Energy deployed the Summit and Sierra supercomputers with the latest state-of-the-art network interconnect technology in 2018 and both systems entered production in 2019. In this paper, we provide an in-depth assessment of the systems' network interconnects that are based on Enhanced Data Rate (EDR) 100 Gb/s Mellanox InfiniBand. Both systems use second-generation EDR Host Channel Adapters (HCAs) and switches with several new features such as Adaptive Routing (AR), switch-based collectives, and HCA-based tag matching. Although based on the same components, Summit's network is "non-blocking" (i.e., a fully provisioned Clos network) and Sierra's network has a 2:1 taper between the racks and aggregation switches. We evaluate the two systems' interconnects using traditional communication benchmarks as well as production applications. We find that the new Adaptive Routing dramatically improves performance but the other new features still need improvement.
Original language | English |
---|---|
Title of host publication | Proceedings of SC 2019 |
Subtitle of host publication | The International Conference for High Performance Computing, Networking, Storage and Analysis |
Publisher | IEEE Computer Society |
ISBN (Electronic) | 9781450362290 |
DOIs | |
State | Published - Nov 17 2019 |
Event | 2019 International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2019 - Denver, United States Duration: Nov 17 2019 → Nov 22 2019 |
Publication series
Name | International Conference for High Performance Computing, Networking, Storage and Analysis, SC |
---|---|
ISSN (Print) | 2167-4329 |
ISSN (Electronic) | 2167-4337 |
Conference
Conference | 2019 International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2019 |
---|---|
Country/Territory | United States |
City | Denver |
Period | 11/17/19 → 11/22/19 |
Funding
This work was performed under the auspices of the U.S. DOE by Oak Ridge Leadership Computing Facility at ORNL under contract DE-AC05-00OR22725. The US government retains and the publisher, by accepting the article for publication, acknowledges that the US government retains a nonexclusive, paid-up, irrevocable, worldwide license to publish or reproduce the published form of this manuscript, or allow others to do so, for US government purposes. DOE will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan (http://energy.gov/downloads/doe-public-access-plan).
Keywords
- Bandwidth
- Congestion
- EDR
- High performance computing
- InfiniBand
- Interconnect
- Latency
- Offload
- Switch collectives
- Tag matching