Analyzing a five-year failure record of a leadership-class supercomputer

Elvis Rojas, Esteban Meneses, Terry Jones, Don Maxwell

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

12 Scopus citations

Abstract

Extreme-scale computing systems are required to solve some of the grand challenges in science and technology. From astrophysics to molecular biology, supercomputers are an essential tool to accelerate scientific discovery. However, large computing systems are prone to failures due to their complexity. It is crucial to develop an understanding of how these systems fail to design reliable supercomputing platforms for the future. This paper examines a five-year failure and workload record of a leadership-class supercomputer. To the best of our knowledge, five years represents the vast majority of the lifespan of a supercomputer. This is the first time such analysis is performed on a top 10 modern supercomputer. We performed a failure categorization and found out that: i) most errors are GPUrelated, with roughly 37% of them being double-bit errors on the cards; ii) failures are not evenly spread across the physical machine, with room temperature presumably playing a major role; and iii) software errors of the system bring down several nodes concurrently. Our failure rate analysis unveils that: i) the system consistently degrades, being at least twice as reliable at the beginning, compared to the end of the period; ii) Weibull distribution closely fits the mean-time-between-failure data; and iii) hardware and software errors show a markedly different pattern. Finally, we correlated failure and workload records to reveal that: i) failure and workload records are weakly correlated, except for certain types of failures when segmented by the hours of the day; ii) several categories of failures make jobs crash within the first minutes of execution; and iii) a significant fraction of failed jobs exhaust the requested time with a disregard of when the failure occurred during execution. Index Terms-Fault tolerance, resilience, failure analysis, high performance computing.

Original languageEnglish
Title of host publicationProceedings - 2019 31st International Symposium on Computer Architecture and High Performance Computing, SBAC-PAD 2019
PublisherIEEE Computer Society
Pages196-203
Number of pages8
ISBN (Electronic)9781728141947
DOIs
StatePublished - Oct 2019
Event31st International Symposium on Computer Architecture and High Performance Computing, SBAC-PAD 2019 - Campo Grande, Brazil
Duration: Oct 15 2019Oct 18 2019

Publication series

NameProceedings - Symposium on Computer Architecture and High Performance Computing
Volume2019-October
ISSN (Print)1550-6533

Conference

Conference31st International Symposium on Computer Architecture and High Performance Computing, SBAC-PAD 2019
Country/TerritoryBrazil
CityCampo Grande
Period10/15/1910/18/19

Funding

Notice: This manuscript has been authored by UT-Battelle, LLC under Contract No. DE-AC05-00OR22725 with the U.S. Department of Energy. The United States Government retains and the publisher, by accepting the article for publication, acknowledges that the United States Government retains a non-exclusive, paid-up, irrevocable, world-wide license to publish or reproduce the published form of this manuscript, or allow others to do so, for United States Government purposes. The Department of Energy will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan (http://energy.gov/downloads/doe-public-access-plan). ACKNOWLEDGMENT This research was partially supported by a machine allocation on Kabré supercomputer at the Costa Rica National High Technology Center.

FundersFunder number
LLC
National High Technology Center
UT-Battelle

    Keywords

    • Failure analysis
    • Fault tolerance
    • High performance computing
    • Resilience

    Fingerprint

    Dive into the research topics of 'Analyzing a five-year failure record of a leadership-class supercomputer'. Together they form a unique fingerprint.

    Cite this