A checkpoint-on-failure protocol for algorithm-based recovery in standard MPI

Wesley Bland, Peng Du, Aurelien Bouteiller, Thomas Herault, George Bosilca, Jack Dongarra

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

22 Scopus citations

Abstract

Most predictions of Exascale machines picture billion way parallelism, encompassing not only millions of cores, but also tens of thousands of nodes. Even considering extremely optimistic advances in hardware reliability, probabilistic amplification entails that failures will be unavoidable. Consequently, software fault tolerance is paramount to maintain future scientific productivity. Two major problems hinder ubiquitous adoption of fault tolerance techniques: 1) traditional checkpoint based approaches incur a steep overhead on failure free operations and 2) the dominant programming paradigm for parallel applications (the MPI standard) offers extremely limited support of software-level fault tolerance approaches. In this paper, we present an approach that relies exclusively on the features of a high quality implementation, as defined by the current MPI standard, to enable algorithmic based recovery, without incurring the overhead of customary periodic checkpointing. The validity and performance of this approach are evaluated on large scale systems, using the QR factorization as an example.

Original languageEnglish
Title of host publicationParallel Processing - 18th International Conference, Euro-Par 2012, Proceedings
Pages477-488
Number of pages12
DOIs
StatePublished - 2012
Externally publishedYes
Event18th International Conference on Parallel Processing, Euro-Par 2012 - Rhodes Island, Greece
Duration: Aug 27 2012Aug 31 2012

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume7484 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference18th International Conference on Parallel Processing, Euro-Par 2012
Country/TerritoryGreece
CityRhodes Island
Period08/27/1208/31/12

Fingerprint

Dive into the research topics of 'A checkpoint-on-failure protocol for algorithm-based recovery in standard MPI'. Together they form a unique fingerprint.

Cite this