Building and using a fault-tolerant MPI implementation

Graham E. Fagg, Jack J. Dongarra

Research output: Contribution to journalArticlepeer-review

32 Scopus citations

Abstract

In this paper we discuss the design and use of a fault-tolerant MPI (FT-MPI) that handles process failures in a way beyond that of the original MPI static process model. FT-MPI allows the semantics and associated modes of failures to be explicitly controlled by an application via a modified functionality within the standard MPI 1.2 API. Given is an overview of the FT-MPI semantics, architecture design, example usage and sample applications. A short discussion is given on the consequences of designing a fault-tolerant MPI both in terms of how such an implementation handles failures at multiple levels internally as well as how existing applications can use new features while still remaining within the MPI standard.

Original languageEnglish
Pages (from-to)353-361
Number of pages9
JournalInternational Journal of High Performance Computing Applications
Volume18
Issue number3
DOIs
StatePublished - Sep 2004
Externally publishedYes

Keywords

  • Fault tolerant
  • MPI
  • Message passing
  • Parallel computing

Fingerprint

Dive into the research topics of 'Building and using a fault-tolerant MPI implementation'. Together they form a unique fingerprint.

Cite this