RDPM: An Extensible Tool for Resilience Design Patterns Modelling

Mohit Kumar, Christian Engelmann

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Resilience to faults, errors, and failures in extreme-scale high-performance computing (HPC) systems is a critical challenge. Resilience design patterns offer a new, structured hardware and software design approach for improving resilience. While prior work focused on developing performance, reliability, and availability models for resilience design patterns, this paper extends it by providing a Resilience Design Patterns Modeling (RDPM) tool which allows (1) exploring performance, reliability, and availability of each resilience design pattern, (2) offering customization of parameters to optimize performance, reliability, and availability, and (3) allowing investigation of trade-off models for combining multiple patterns for practical resilience solutions.

Original languageEnglish
Title of host publicationEuro-Par 2021
Subtitle of host publicationParallel Processing Workshops - Euro-Par 2021 International Workshops, 2021, Revised Selected Papers
EditorsRicardo Chaves, Dora B. Heras, Aleksandar Ilic, Didem Unat, Rosa M. Badia, Andrea Bracciali, Patrick Diehl, Anshu Dubey, Oh Sangyoon, Stephen L. Scott, Laura Ricci
PublisherSpringer Science and Business Media Deutschland GmbH
Pages283-297
Number of pages15
ISBN (Print)9783031061554
DOIs
StatePublished - 2022
Event27th International Conference on Parallel and Distributed Computing, Euro-Par 2021 - Virtual, Online
Duration: Aug 30 2021Aug 31 2021

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume13098 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference27th International Conference on Parallel and Distributed Computing, Euro-Par 2021
CityVirtual, Online
Period08/30/2108/31/21

Funding

Acknowledgements. This work was supported by the U.S. Department of Energy, Office of Science, Office of Advanced Scientific Computing Research, program managers Robinson Pino and Lucy Nowell. This manuscript has been authored by UT-Battelle, LLC under Contract No. DE-AC05-00OR22725 with the U.S. Department of Energy. This work was sponsored by the U.S. Department of Energy’s Office of Advanced Scientific Computing Research. This manuscript has been authored by UT-Battelle, LLC under Contract No. DE-AC05-00OR22725 with the U.S. Department of Energy. The United States Government retains and the publisher, by accepting the article for publication, acknowledges that the United States Government retains a non-exclusive, paid-up, irrevocable, world-wide license to publish or reproduce the published form of this manuscript, or allow others to do so, for United States Government purposes. The Department of Energy will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan (http://energy.gov/ downloads/doe-public-access-plan).

FundersFunder number
U.S. Department of Energy
Office of Science
Advanced Scientific Computing ResearchDE-AC05-00OR22725

    Keywords

    • Design patterns
    • High-performance computing
    • Resilience
    • Tool

    Fingerprint

    Dive into the research topics of 'RDPM: An Extensible Tool for Resilience Design Patterns Modelling'. Together they form a unique fingerprint.

    Cite this