Understanding failures through the lifetime of a top-level supercomputer

Elvis Rojas, Esteban Meneses, Terry Jones, Don Maxwell

Research output: Contribution to journalArticlepeer-review

2 Scopus citations

Abstract

High performance computing systems are required to solve grand challenges in many scientific disciplines. These systems assemble many components to be powerful enough for solving extremely complex problems. An inherent consequence is the intricacy of the interaction of all those components, especially when failures come into the picture. It is crucial to develop an understanding of how these systems fail to design reliable supercomputing platforms in the future. This paper presents the results on studying multi-year failure and workload records of a powerful supercomputer that topped the world rankings. We provide a thorough analysis of the data and characterize the reliability of the system through several dimensions: failure classification, failure-rate modelling, and interplay between failures and workload. The results shed some light on the dynamics of top-level supercomputers and sensitive areas ripe for improvement.

Original languageEnglish
Pages (from-to)27-41
Number of pages15
JournalJournal of Parallel and Distributed Computing
Volume154
DOIs
StatePublished - Aug 2021

Funding

This research was partially supported by a machine allocation on Kabré supercomputer at the Costa Rica National High Technology Center. Notice: This manuscript has been authored by UT-Battelle, LLC under Contract No. DE-AC05-00OR22725 with the U.S. Department of Energy. The United States Government retains and the publisher, by accepting the article for publication, acknowledges that the United States Government retains a non-exclusive, paid-up, irrevocable, world-wide license to publish or reproduce the published form of this manuscript, or allow others to do so, for United States Government purposes. The Department of Energy will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan ( http://energy.gov/downloads/doe-public-access-plan ).

FundersFunder number
Costa Rica National High Technology Center
U.S. Department of Energy

    Keywords

    • Failure analysis
    • Fault tolerance
    • High performance computing
    • Resilience

    Fingerprint

    Dive into the research topics of 'Understanding failures through the lifetime of a top-level supercomputer'. Together they form a unique fingerprint.

    Cite this