Multi-criteria checkpointing strategies: Response-time versus resource utilization

Aurelien Bouteiller, Franck Cappello, Jack Dongarra, Amina Guermouche, Thomas Hérault, Yves Robert

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

6 Scopus citations

Abstract

Failures are increasingly threatening the efficiency of HPC systems, and current projections of Exascale platforms indicate that rollback recovery, the most convenient method for providing fault tolerance to general-purpose applications, reaches its own limits at such scales. One of the reasons explaining this unnerving situation comes from the focus that has been given to per-application completion time, rather than to platform efficiency. In this paper, we discuss the case of uncoordinated rollback recovery where the idle time spent waiting recovering processors is used to progress a different, independent application from the system batch queue. We then propose an extended model of uncoordinated checkpointing that can discriminate between idle time and wasted computation. We instantiate this model in a simulator to demonstrate that, with this strategy, uncoordinated checkpointing per application completion time is unchanged, while it delivers near-perfect platform efficiency.

Original languageEnglish
Title of host publicationEuro-Par 2013 Parallel Processing - 19th International Conference, Proceedings
Pages420-431
Number of pages12
DOIs
StatePublished - 2013
Externally publishedYes
Event19th International Conference on Parallel Processing, Euro-Par 2013 - Aachen, Germany
Duration: Aug 26 2013Aug 30 2013

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume8097 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference19th International Conference on Parallel Processing, Euro-Par 2013
Country/TerritoryGermany
CityAachen
Period08/26/1308/30/13

Fingerprint

Dive into the research topics of 'Multi-criteria checkpointing strategies: Response-time versus resource utilization'. Together they form a unique fingerprint.

Cite this