A pattern language for high-performance computing resilience

Saurabh Hukerikar, Christian Engelmann

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

3 Scopus citations

Abstract

High-performance computing systems (HPC) provide powerful capabilities for modeling, simulation, and data analytics for a broad class of computational problems. They enable extreme performance of the order of quadrillion floating-point arithmetic calculations per second by aggregating the power of millions of compute, memory, networking and storage components. With the rapidly growing scale and complexity of HPC systems for achieving even greater performance, ensuring their reliable operation in the face of system degradations and failures is a critical challenge. System fault events often lead the scientific applications to produce incorrect results, or may even cause their untimely termination. The sheer number of components in modern extreme-scale HPC systems and the complex interactions and dependencies among the hardware and software components, the applications, and the physical environment makes the design of practical solutions that support fault resilience a complex undertaking. To manage this complexity, we developed a methodology for designing HPC resilience solutions using design patterns. We codified the well-known techniques for handling faults, errors and failures that have been devised, applied and improved upon over the past three decades in the form of design patterns. In this paper, we present a pattern language to enable a structured approach to the development of HPC resilience solutions. The pattern language reveals the relations among the resilience patterns and provides the means to explore alternative techniques for handling a specific fault model that may have different efficiency and complexity characteristics. Using the pattern language enables the design and implementation of comprehensive resilience solutions as a set of interconnected resilience patterns that can be instantiated across layers of the system stack.

Original languageEnglish
Title of host publicationProceedings of the 22nd European Conference on Pattern Languages of Programs, EuroPLoP 2017
PublisherAssociation for Computing Machinery
ISBN (Electronic)9781450348485
DOIs
StatePublished - Jul 12 2017
Event22nd European Conference on Pattern Languages of Programs, EuroPLoP 2017 - Irsee, Germany
Duration: Jul 12 2017Jul 16 2017

Publication series

NameACM International Conference Proceeding Series
VolumePart F132091

Conference

Conference22nd European Conference on Pattern Languages of Programs, EuroPLoP 2017
Country/TerritoryGermany
CityIrsee
Period07/12/1707/16/17

Funding

This material is based upon work supported by the U.S. Department of Energy, Office of Science, Office of Advanced Scientific Computing Research, program manager Lucy Nowell, under contract number DE-AC05-00OR22725. We thank our shepherd Klaus Marquardt for his comments and suggestions that greatly improved the manuscript. Saurabh Hukerikar, Christian Engelmann. 2017. A Pattern Language for High-Performance Computing Resilience EuroPLoP (July 2017), 16 pages. DOI: 10.1145/3147704.3147718 This work was sponsored by the U.S. Department of Energy’s Office of Advanced Scientific Computing Research. This manuscript has been authored by UT-Battelle, LLC under Contract No. DE-AC05-00OR22725 with the U.S. Department of Energy. The United States Government retains and the publisher, by accepting the article for publication, acknowledges that the United States Government retains a non-exclusive, paid-up, irrevocable, world-wide license to publish or reproduce the published form of this manuscript, or allow others to do so, for United States Government purposes. The Department of Energy will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan (http://energy.gov/downloads/doe-public-access-plan).

FundersFunder number
U.S. Department of Energy
Office of Science
Advanced Scientific Computing ResearchDE-AC05-00OR22725

    Keywords

    • Design patterns
    • Fault tolerance
    • High-performance computing
    • Resilience

    Fingerprint

    Dive into the research topics of 'A pattern language for high-performance computing resilience'. Together they form a unique fingerprint.

    Cite this