TY - GEN
T1 - Understanding and analyzing interconnect errors and network congestion on a large scale HPC system
AU - Kumar, Mohit
AU - Gupta, Saurabh
AU - Patel, Tirthak
AU - Wilder, Michael
AU - Shi, Weisong
AU - Fu, Song
AU - Engelmann, Christian
AU - Tiwari, Devesh
N1 - Publisher Copyright:
© 2018 IEEE.
PY - 2018/7/19
Y1 - 2018/7/19
N2 - Today's High Performance Computing (HPC) systems are capable of delivering performance in the order of petaflops due to the fast computing devices, network interconnect, and back-end storage systems. In particular, interconnect resilience and congestion resolution methods have a major impact on the overall interconnect and application performance. This is especially true for scientific applications running multiple processes on different compute nodes as they rely on fast network messages to communicate and synchronize frequently. Unfortunately, the HPC community lacks state-of-practice experience reports that detail how different interconnect errors and congestion events occur on large-scale HPC systems. Therefore, in this paper, we process and analyze interconnect data of the Titan supercomputer to develop a thorough understanding of interconnects faults, errors and congestion events. We also study the interaction between interconnect, errors, network congestion and application characteristics.
AB - Today's High Performance Computing (HPC) systems are capable of delivering performance in the order of petaflops due to the fast computing devices, network interconnect, and back-end storage systems. In particular, interconnect resilience and congestion resolution methods have a major impact on the overall interconnect and application performance. This is especially true for scientific applications running multiple processes on different compute nodes as they rely on fast network messages to communicate and synchronize frequently. Unfortunately, the HPC community lacks state-of-practice experience reports that detail how different interconnect errors and congestion events occur on large-scale HPC systems. Therefore, in this paper, we process and analyze interconnect data of the Titan supercomputer to develop a thorough understanding of interconnects faults, errors and congestion events. We also study the interaction between interconnect, errors, network congestion and application characteristics.
KW - Cray
KW - Errors
KW - Gemini
KW - Interconnect
KW - Titan
UR - http://www.scopus.com/inward/record.url?scp=85051065059&partnerID=8YFLogxK
U2 - 10.1109/DSN.2018.00023
DO - 10.1109/DSN.2018.00023
M3 - Conference contribution
AN - SCOPUS:85051065059
T3 - Proceedings - 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2018
SP - 107
EP - 114
BT - Proceedings - 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2018
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2018
Y2 - 25 June 2018 through 28 June 2018
ER -