HARNESS and fault tolerant MPI

Graham E. Fagg, Antonin Bukovsky, Jack J. Dongarra

Research output: Contribution to journalArticlepeer-review

46 Scopus citations

Abstract

Initial versions of MPI were designed to work efficiently on multi-processors which had very little job control and thus static process models. Subsequently forcing them to support a dynamic process model would have affected their performance. As current HPC systems increase in size with greater potential levels of individual node failure, the need arises for new fault tolerant systems to be developed. Here we present a new implementation of MPI called fault tolerant MPI (FT-MPI) that allows the semantics and associated modes of failures to be explicitly controlled by an application via a modified MPI API. Given is an overview of the FT-MPI semantics, design, example applications, debugging tools and some performance issues. Also discussed is the experimental HARNESS core (G_HCORE) implementation that FT-MPI is built to operate upon.

Original languageEnglish
Pages (from-to)1479-1495
Number of pages17
JournalParallel Computing
Volume27
Issue number11
DOIs
StatePublished - Oct 2001
Externally publishedYes

Keywords

  • Fault tolerant application
  • Message passing
  • Metacomputing middleware
  • Parallel computing

Fingerprint

Dive into the research topics of 'HARNESS and fault tolerant MPI'. Together they form a unique fingerprint.

Cite this