Understanding and analyzing interconnect errors and network congestion on a large scale HPC system

Mohit Kumar, Saurabh Gupta, Tirthak Patel, Michael Wilder, Weisong Shi, Song Fu, Christian Engelmann, Devesh Tiwari

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

11 Scopus citations

Abstract

Today's High Performance Computing (HPC) systems are capable of delivering performance in the order of petaflops due to the fast computing devices, network interconnect, and back-end storage systems. In particular, interconnect resilience and congestion resolution methods have a major impact on the overall interconnect and application performance. This is especially true for scientific applications running multiple processes on different compute nodes as they rely on fast network messages to communicate and synchronize frequently. Unfortunately, the HPC community lacks state-of-practice experience reports that detail how different interconnect errors and congestion events occur on large-scale HPC systems. Therefore, in this paper, we process and analyze interconnect data of the Titan supercomputer to develop a thorough understanding of interconnects faults, errors and congestion events. We also study the interaction between interconnect, errors, network congestion and application characteristics.

Original languageEnglish
Title of host publicationProceedings - 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2018
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages107-114
Number of pages8
ISBN (Electronic)9781538655955
DOIs
StatePublished - Jul 19 2018
Event48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2018 - Luxembourg City, Luxembourg
Duration: Jun 25 2018Jun 28 2018

Publication series

NameProceedings - 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2018

Conference

Conference48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2018
Country/TerritoryLuxembourg
CityLuxembourg City
Period06/25/1806/28/18

Funding

This manuscript has been authored by UT-Battelle,LLC under Contract No. DE-AC05-00OR22725with the U.S. Department of Energy. The United States Government retains and the publisher, by accepting the article for publication, acknowledges that the United States Government retains a non-exclusive, paid-up, irrevocable, world-wide license to publish or reproduce the published form of this manuscript, or allow others to do so, for United States Government purposes. The Department of Energy will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan (http://energy.gov/downloads/doe-public-access-plan).

FundersFunder number
Office of AdvancedScientific Computing Research
National Science Foundation1563750, 1561216, 1563728
U.S. Department of EnergyDE-AC05-00OR22725
Office of Science
North-Eastern Hill University

    Keywords

    • Cray
    • Errors
    • Gemini
    • Interconnect
    • Titan

    Fingerprint

    Dive into the research topics of 'Understanding and analyzing interconnect errors and network congestion on a large scale HPC system'. Together they form a unique fingerprint.

    Cite this