Abstract
Today's High Performance Computing (HPC) systems are capable of delivering performance in the order of petaflops due to the fast computing devices, network interconnect, and back-end storage systems. In particular, interconnect resilience and congestion resolution methods have a major impact on the overall interconnect and application performance. This is especially true for scientific applications running multiple processes on different compute nodes as they rely on fast network messages to communicate and synchronize frequently. Unfortunately, the HPC community lacks state-of-practice experience reports that detail how different interconnect errors and congestion events occur on large-scale HPC systems. Therefore, in this paper, we process and analyze interconnect data of the Titan supercomputer to develop a thorough understanding of interconnects faults, errors and congestion events. We also study the interaction between interconnect, errors, network congestion and application characteristics.
Original language | English |
---|---|
Title of host publication | Proceedings - 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2018 |
Publisher | Institute of Electrical and Electronics Engineers Inc. |
Pages | 107-114 |
Number of pages | 8 |
ISBN (Electronic) | 9781538655955 |
DOIs | |
State | Published - Jul 19 2018 |
Event | 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2018 - Luxembourg City, Luxembourg Duration: Jun 25 2018 → Jun 28 2018 |
Publication series
Name | Proceedings - 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2018 |
---|
Conference
Conference | 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2018 |
---|---|
Country/Territory | Luxembourg |
City | Luxembourg City |
Period | 06/25/18 → 06/28/18 |
Funding
This manuscript has been authored by UT-Battelle,LLC under Contract No. DE-AC05-00OR22725with the U.S. Department of Energy. The United States Government retains and the publisher, by accepting the article for publication, acknowledges that the United States Government retains a non-exclusive, paid-up, irrevocable, world-wide license to publish or reproduce the published form of this manuscript, or allow others to do so, for United States Government purposes. The Department of Energy will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan (http://energy.gov/downloads/doe-public-access-plan).
Keywords
- Cray
- Errors
- Gemini
- Interconnect
- Titan