Pattern-based modeling of high-performance computing resilience

Saurabh Hukerikar, Christian Engelmann

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

1 Scopus citations

Abstract

With the growing scale and complexity of high-performance computing (HPC) systems, resilience solutions that ensure continuity of service despite frequent errors and component failures must be methodically designed to balance the reliability requirements with the overheads to performance and power. Design patterns enable a structured approach to the development of resilience solutions, providing hardware and software designers with the building block elements for the rapid development of novel solutions and for adapting existing technologies for emerging, extreme-scale HPC environments. In this paper, we develop analytical models that enable designers to evaluate the reliability and performance characteristics of the design patterns. These models are particularly useful in building a unified framework that analyzes and compares various resilience solutions built using a combination of patterns.

Original languageEnglish
Title of host publicationEuro-Par 2017
Subtitle of host publicationParallel Processing Workshops - Euro-Par 2017 International Workshops
EditorsDora B. Heras, Luc Bouge
PublisherSpringer Verlag
Pages557-568
Number of pages12
ISBN (Print)9783319751771
DOIs
StatePublished - 2018
EventInternational Workshops on Parallel Processing, Euro-Par 2017 - Santiago de Compostela, Spain
Duration: Aug 28 2017Aug 29 2017

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume10659 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

ConferenceInternational Workshops on Parallel Processing, Euro-Par 2017
Country/TerritorySpain
CitySantiago de Compostela
Period08/28/1708/29/17

Funding

This work was sponsored by the U.S. Department of Energy’s Office of Advanced Scientific Computing Research. This manuscript has been authored by UT-Battelle, LLC under Contract No. DE-AC05-00OR22725 with the U.S. Department of Energy. The United States Government retains and the publisher, by accepting the article for publication, acknowledges that the United States Government retains a nonexclusive, paid-up, irrevocable, world-wide license to publish or reproduce the published form of this manuscript, or allow others to do so, for United States Government purposes. The Department of Energy will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan (http://energy.gov/downloads/doe-public-access-plan). Acknowledgements. This material is based upon work supported by the U.S. Department of Energy, Office of Science, Office of Advanced Scientific Computing Research, program manager Lucy Nowell, under contract number DE-AC05-00OR22725.

Keywords

  • High-performance computing
  • Modeling
  • Patterns performance
  • Reliability
  • Resilience

Fingerprint

Dive into the research topics of 'Pattern-based modeling of high-performance computing resilience'. Together they form a unique fingerprint.

Cite this