Correlated set coordination in fault tolerant message logging protocols for many-core clusters

Aurelien Bouteiller, Thomas Herault, George Bosilca, Jack J. Dongarra

Research output: Contribution to journalArticlepeer-review

15 Scopus citations

Abstract

With our current expectation for the exascale systems, composed of hundred of thousands of many-core nodes, the mean time between failures will become small, even under the most optimistic assumptions. One of the most scalable checkpoint restart techniques, the message logging approach, is the most challenged when the number of cores per node increases because of the high overhead of saving the message payload. Fortunately, for two processes on the same node, the failure probability is correlated, meaning that coordinated recovery is free. In this paper, we propose an intermediate approach that uses coordination between correlated processes but retains the scalability advantage of message logging between independent ones. The algorithm still belongs to the family of event logging protocols but eliminates the need for costly payload logging between coordinated processes.

Original languageEnglish
Pages (from-to)572-585
Number of pages14
JournalConcurrency and Computation: Practice and Experience
Volume25
Issue number4
DOIs
StatePublished - Feb 2013
Externally publishedYes

Keywords

  • checkpoint/restart
  • fault tolerance
  • multicore clusters

Fingerprint

Dive into the research topics of 'Correlated set coordination in fault tolerant message logging protocols for many-core clusters'. Together they form a unique fingerprint.

Cite this