Correlated set coordination in fault tolerant message logging protocols

Aurelien Bouteiller, Thomas Herault, George Bosilca, Jack J. Dongarra

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

26 Scopus citations

Abstract

Based on our current expectation for the exascale systems, composed of hundred of thousands of many-core nodes, the mean time between failures will become small, even under the most optimistic assumptions. One of the most scalable checkpoint restart techniques, the message logging approach, is the most challenged when the number of cores per node increases, due to the high overhead of saving the message payload. Fortunately, for two processes on the same node, the failure probability is correlated, meaning that coordinated recovery is free. In this paper, we propose an intermediate approach that uses coordination between correlated processes, but retains the scalability advantage of message logging between independent ones. The algorithm still belongs to the family of event logging protocols, but eliminates the need for costly payload logging between coordinated processes.

Original languageEnglish
Title of host publicationEuro-Par 2011 Parallel Processing - 17th International Conference, Proceedings
Pages51-64
Number of pages14
EditionPART 2
DOIs
StatePublished - 2011
Event17th International Conference on Parallel Processing, Euro-Par 2011 - Bordeaux, France
Duration: Aug 29 2011Sep 2 2011

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
NumberPART 2
Volume6853 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference17th International Conference on Parallel Processing, Euro-Par 2011
Country/TerritoryFrance
CityBordeaux
Period08/29/1109/2/11

Fingerprint

Dive into the research topics of 'Correlated set coordination in fault tolerant message logging protocols'. Together they form a unique fingerprint.

Cite this