Models for Resilience Design Patterns

Mohit Kumar, Christian Engelmann

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

1 Scopus citations

Abstract

Resilience plays an important role in supercomputers by providing correct and efficient operation in case of faults, errors, and failures. Resilience design patterns offer blueprints for effectively applying resilience technologies. Prior work focused on developing initial efficiency and performance models for resilience design patterns. This paper extends it by (1) describing performance, reliability, and availability models for all structural resilience design patterns, (2) providing more detailed models that include flowcharts and state diagrams, and (3) introducing the Resilience Design Pattern Modeling (RDPM) tool that calculates and plots the performance, reliability, and availability metrics of individual patterns and pattern combinations.

Original languageEnglish
Title of host publicationProceedings of FTXS 2020
Subtitle of host publicationFault Tolerance for HPC at eXtreme Scale, Held in conjunction with SC 2020: The International Conference for High Performance Computing, Networking, Storage and Analysis
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages21-30
Number of pages10
ISBN (Electronic)9781665422895
DOIs
StatePublished - Nov 2020
Event10th IEEE/ACM Workshop on Fault Tolerance for HPC at eXtreme Scale, FTXS 2020 - Virtual, Atlanta, United States
Duration: Nov 11 2020 → …

Publication series

NameProceedings of FTXS 2020: Fault Tolerance for HPC at eXtreme Scale, Held in conjunction with SC 2020: The International Conference for High Performance Computing, Networking, Storage and Analysis

Conference

Conference10th IEEE/ACM Workshop on Fault Tolerance for HPC at eXtreme Scale, FTXS 2020
Country/TerritoryUnited States
CityVirtual, Atlanta
Period11/11/20 → …

Funding

ACKNOWLEDGMENT This work was supported by the U.S. Department of Energy, Office of Science, Office of Advanced Scientific Computing Research, Early Career Program, with program managers Robinson Pino and Lucy Nowell. This work was sponsored by the U.S. Department of Energy’s Office of Advanced Scientific Computing Research. This manuscript has been authored by UT-Battelle, LLC under Contract No. DE-AC05-00OR22725 with the U.S. Department of Energy. The United States Government retains and the publisher, by accepting the article for publication, acknowledges that the United States Government retains a non-exclusive, paid-up, irrevocable, world-wide license to publish or reproduce the published form of this manuscript, or allow others to do so, for United States Government purposes. The Department of Energy will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan (http://energy.gov/downloads/doe-public-access-plan).

FundersFunder number
U.S. Department of Energy
Office of Science
Advanced Scientific Computing ResearchDE-AC05-00OR22725

    Keywords

    • design patterns
    • high-performance computing
    • models
    • resilience

    Fingerprint

    Dive into the research topics of 'Models for Resilience Design Patterns'. Together they form a unique fingerprint.

    Cite this