A scalable checkpoint encoding algorithm for diskless checkpointing

Zizhong Chen, Jack Dongarra

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

17 Scopus citations

Abstract

Diskless checkpointing is an efficient technique to save the state of a long running application in a distributed environment without relying on stable storage. In this paper, we introduce several scalable encoding strategies into diskless checkpointing and reduce the overhead to survive k failures in p processes from 2[log p].k((β+ 2γ)m + α) to (1 + 0(1/√m)).k{β + 2γ)m, where α is the communication latency, 1/β is the network bandwidth between processes. 1/γ is the rate to perform calculations, and m is the size of local checkpoint per process. The introduced algorithm is scalable in the sense that the overhead to survive k failures in p processes does not increase as the number of processes p increases. We evaluate the performance overhead of the introduced algorithm by using a preconditioned conjugate gradient equation solver as an example. Experimental results demonstrate that the introduced techniques are highly scalable.

Original languageEnglish
Title of host publicationProceedings - 11th IEEE High Assurance Systems Engineering Symposium, HASE 2008
Pages71-79
Number of pages9
DOIs
StatePublished - 2008
Externally publishedYes
Event11th IEEE High Assurance Systems Engineering Symposium, HASE 2008 - Nanjing, China
Duration: Dec 3 2008Dec 5 2008

Publication series

NameProceedings of IEEE International Symposium on High Assurance Systems Engineering
ISSN (Print)1530-2059

Conference

Conference11th IEEE High Assurance Systems Engineering Symposium, HASE 2008
Country/TerritoryChina
CityNanjing
Period12/3/0812/5/08

Keywords

  • Checkpoint
  • Diskless checkpointing
  • Fault tolerance
  • High performance computing
  • Parallel and distributed systems
  • Reed-solomon encoding

Fingerprint

Dive into the research topics of 'A scalable checkpoint encoding algorithm for diskless checkpointing'. Together they form a unique fingerprint.

Cite this