Toward exascale resilience: 2014 update

Franck Cappello, Al Geist, William Gropp, Sanjay Kale, Bill Kramer, Marc Snir

Research output: Contribution to journalArticlepeer-review

241 Scopus citations

Abstract

Resilience is a major roadblock for HPC executions on future exascale systems. These systems will typically gather millions of CPU cores running up to a billion threads. Projections from current large systems and technology evolution predict errors will happen in exascale systems many times per day. These errors will propagate and generate various kinds of malfunctions, from simple process crashes to result corruptions. The past five years have seen extraordinary technical progress in many domains related to exascale resilience. Several technical options, initially considered inapplicable or unrealistic in the HPC context, have demonstrated surprising successes. Despite this progress, the exascale resilience problem is not solved, and the community is still facing the difficult challenge of ensuring that exascale applications complete and generate correct results while running on unstable systems. Since 2009, many workshops, studies, and reports have improved the definition of the resilience problem and provided refined recommendations. Some projections made during the previous decades and some priorities established from these projections need to be revised. This paper surveys what the community has learned in the past five years and summarizes the research problems still considered critical by the HPC community.

Original languageEnglish
Pages (from-to)4-27
Number of pages24
JournalSupercomputing Frontiers and Innovations
Volume1
Issue number1
DOIs
StatePublished - 2014

Funding

We thank Estaban Meneses for his help in the software section. This work was supported by the U.S. Department of Energy, Office of Science, Advanced Scientific Computing Research, under Contract DE-AC02-06CH11357, and under Award DESC0004131. This work was also supported in part by the Office of Advanced Scientific Computing Research, Office of Science, U.S. Department of Energy award DE-FG02-13ER26138/DE-SC0010049. This research is part of the Blue Waters sustained-petascale computing project, which is supported by the National Science Foundation (award number ACI 1238993) and the state of Illinois. Blue Waters is a joint effort of the University of Illinois at Urbana-Champaign and its National Center for Supercomputing Applications. The submitted manuscript has been created by UChicago Argonne, LLC, Operator of Argonne National Laboratory ('Argonne'). Argonne, a U.S. Department of Energy Office of Science laboratory, is operated under Contract No. DE-AC02-06CH11357. The U.S. Government retains for itself, and others acting on its behalf, a paid-up nonexclusive, irrevocable worldwide license in said article to reproduce, prepare derivative works, distribute copies to the public, and perform publicly and display publicly, by or on behalf of the Government.

FundersFunder number
U.S. Department of Energy Office of Science Laboratory
National Science FoundationACI 1238993
U.S. Department of Energy
University of Illinois at Urbana-Champaign
Office of Science
Advanced Scientific Computing ResearchDESC0004131, DE-AC02-06CH11357, DE-FG02-13ER26138/DE-SC0010049
Argonne National Laboratory
University of Chicago
University of Illinois
National Centre for Supercomputing Applications
National Science Foundation

    Keywords

    • Exascale
    • Fault-tolerance techniques
    • Resilience

    Fingerprint

    Dive into the research topics of 'Toward exascale resilience: 2014 update'. Together they form a unique fingerprint.

    Cite this