Process fault tolerance: Semantics, design and applications for high performance computing

Graham E. Fagg, Edgar Gabriel, Zizhong Chen, Thara Angskun, George Bosilca, Jelena Pjesivac-Grbovic, Jack J. Dongarra

Research output: Contribution to journalArticlepeer-review

32 Scopus citations

Abstract

With increasing numbers of processors on current machines, the probability for node or link failures is also increasing. Therefore, application-level fault tolerance is becoming more of an important issue for both end-users and the institutions running the machines. In this paper we present the semantics of a fault-tolerant version of the message passing interface (MPI), the de-facto standard for communication in scientific applications, which gives applications the possibility to recover from a node or link error and continue execution in a well-defined way. We present the architecture of fault-tolerant MPI, an implementation of MPI using the semantics presented above as well as benchmark results with various applications. An example of a fault-tolerant parallel equation solver, performance results as well as the time for recovering from a process failure are furthermore detailed.

Original languageEnglish
Pages (from-to)465-477
Number of pages13
JournalInternational Journal of High Performance Computing Applications
Volume19
Issue number4
DOIs
StatePublished - Dec 2005
Externally publishedYes

Keywords

  • Fault tolerance
  • MPI and message passing
  • Parallel computing

Fingerprint

Dive into the research topics of 'Process fault tolerance: Semantics, design and applications for high performance computing'. Together they form a unique fingerprint.

Cite this