Toward local failure local recovery resilience model using MPI-ULFM

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

51 Scopus citations

Abstract

The current system reaction to the loss of a single MPI process is to kill all the remaining processes and restart the application from the most recent checkpoint. This approach will become unfeasible for future extreme scale systems. We address this issue using an emerging resilient computing model called Local Failure Local Recovery (LFLR) that provides application developers with the ability to recover locally and continue application execution when a process is lost. We discuss the design of our software framework to enable the LFLR model using MPI-ULFM and demonstrate the resilient version of MiniFE that achieves a scalable recovery from process failures.

Original languageEnglish
Title of host publicationProceedings of the 21st European MPI Users' Group Meeting, EuroMPI/ASIA 2014
PublisherAssociation for Computing Machinery
Pages51-56
Number of pages6
ISBN (Electronic)9781450328753
DOIs
StatePublished - Sep 9 2014
Externally publishedYes
Event21st European MPI Users' Group Meeting, EuroMPI/ASIA 2014 - Kyoto, Japan
Duration: Sep 9 2014Sep 12 2014

Publication series

NameACM International Conference Proceeding Series
Volume09-12-September-2014

Conference

Conference21st European MPI Users' Group Meeting, EuroMPI/ASIA 2014
Country/TerritoryJapan
CityKyoto
Period09/9/1409/12/14

Funding

The authors would like to thank George Bosilca, Robert Clay, Pedro Diniz and Mark Hoemmen for interesting discussions related to this work. This work was supported by the U.S. Department of Energy (DOE) National Nuclear Security Administration (NNSA) Advanced Simulation and Computing (ASC) program. Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000.

Keywords

  • Fault Tolerance
  • MPI
  • PDE solvers
  • Scientific Computing
  • User Level Fault Mitigation

Fingerprint

Dive into the research topics of 'Toward local failure local recovery resilience model using MPI-ULFM'. Together they form a unique fingerprint.

Cite this