Performance efficient multiresilience using checkpoint recovery in iterative algorithms

Rizwan A. Ashraf, Christian Engelmann

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

In this paper, we address the design challenge of building multiresilient iterative high-performance computing (HPC) applications. Multiresilience in HPC applications is the ability to tolerate and maintain forward progress in the presence of both soft errors and process failures. We address the challenge by proposing performance models which are useful to design performance efficient and resilient iterative applications. The models consider the interaction between soft error and process failure resilience solutions. We experimented with a linear solver application with two distinct kinds of soft error detectors: one detector has high overhead and high accuracy, whereas the second has low overhead and low accuracy. We show how both can be leveraged for verifying the integrity of checkpointed state used to recover from both soft errors and process failures. Our results show the performance efficiency and resiliency benefit of employing the low overhead detector with high frequency within the checkpoint interval, so that timely soft error recovery can take place, resulting in less re-computed work.

Original languageEnglish
Title of host publicationEuro-Par 2018
Subtitle of host publicationParallel Processing Workshops - Euro-Par 2018 International Workshops, Revised Selected Papers
EditorsGabriele Mencagli, Dora B. Heras
PublisherSpringer Verlag
Pages813-825
Number of pages13
ISBN (Print)9783030105488
DOIs
StatePublished - 2019
Event24th International Conference on Parallel and Distributed Computing, Euro-Par 2018 - Turin, Italy
Duration: Aug 27 2018Aug 28 2018

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume11339 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference24th International Conference on Parallel and Distributed Computing, Euro-Par 2018
Country/TerritoryItaly
CityTurin
Period08/27/1808/28/18

Funding

Process failures · Fault injection · Checkpoint restart Design patterns · Iterative algorithms · Linear solver · Performance Analytical models This work was sponsored by the U.S. Department of Energy’s Office of Advanced Scientific Computing Research. This manuscript has been authored by UT-Battelle, LLC under Contract No. DE-AC05-00OR22725 with the U.S. Department of Energy. The United States Government retains and the publisher, by accepting the article for publication, acknowledges that the United States Government retains a non-exclusive, paid-up, irrevocable, world-wide license to publish or reproduce the published form of this manuscript, or allow others to do so, for United States Government purposes. The Department of Energy will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan (http://energy.gov/ downloads/doe-public-access-plan). Acknowledgements. This material is based upon work supported by the U.S. Department of Energy, Office of Science, Office of Advanced Scientific Computing Research, program manager Lucy Nowell, under contract number DE-AC05-00OR22725.

Keywords

  • Analytical models
  • Checkpoint restart
  • Design patterns
  • Fault injection
  • High-performance computing
  • Iterative algorithms
  • Linear solver
  • Performance
  • Process failures
  • Resilience
  • Soft errors

Fingerprint

Dive into the research topics of 'Performance efficient multiresilience using checkpoint recovery in iterative algorithms'. Together they form a unique fingerprint.

Cite this