TY - JOUR
T1 - Study of interconnect errors, network congestion, and applications characteristics for throttle prediction on a large scale HPC system
AU - Kumar, Mohit
AU - Gupta, Saurabh
AU - Patel, Tirthak
AU - Wilder, Michael
AU - Shi, Weisong
AU - Fu, Song
AU - Engelmann, Christian
AU - Tiwari, Devesh
N1 - Publisher Copyright:
© 2021 Elsevier Inc.
PY - 2021/7
Y1 - 2021/7
N2 - Today's High Performance Computing (HPC) systems contain thousand of nodes which work together to provide performance in the order of petaflops. The performance of these systems depends on various components like processors, memory, and interconnect. Among all, interconnect plays a major role as it glues together all the hardware components in an HPC system. A slow interconnect can impact a scientific application running on multiple processes severely as they rely on fast network messages to communicate and synchronize frequently. Unfortunately, the HPC community lacks a study that explores different interconnect errors, congestion events and applications characteristics on a large-scale HPC system. In our previous work, we process and analyze interconnect data of the Titan supercomputer to develop a thorough understanding of interconnects faults, errors, and congestion events. In this work, we first show how congestion events can impact application performance. We then investigate application characteristics interaction with interconnect errors and network congestion to predict applications encountering congestion with more than 90% accuracy.
AB - Today's High Performance Computing (HPC) systems contain thousand of nodes which work together to provide performance in the order of petaflops. The performance of these systems depends on various components like processors, memory, and interconnect. Among all, interconnect plays a major role as it glues together all the hardware components in an HPC system. A slow interconnect can impact a scientific application running on multiple processes severely as they rely on fast network messages to communicate and synchronize frequently. Unfortunately, the HPC community lacks a study that explores different interconnect errors, congestion events and applications characteristics on a large-scale HPC system. In our previous work, we process and analyze interconnect data of the Titan supercomputer to develop a thorough understanding of interconnects faults, errors, and congestion events. In this work, we first show how congestion events can impact application performance. We then investigate application characteristics interaction with interconnect errors and network congestion to predict applications encountering congestion with more than 90% accuracy.
KW - Cray
KW - Errors
KW - Gemini
KW - Interconnect
KW - Titan
UR - https://www.scopus.com/pages/publications/85103622930
U2 - 10.1016/j.jpdc.2021.03.001
DO - 10.1016/j.jpdc.2021.03.001
M3 - Article
AN - SCOPUS:85103622930
SN - 0743-7315
VL - 153
SP - 29
EP - 43
JO - Journal of Parallel and Distributed Computing
JF - Journal of Parallel and Distributed Computing
ER -