HARNESS fault tolerant MPI design, usage and performance issues

Graham E. Fagg, Jack J. Dongarra

Research output: Contribution to journalArticlepeer-review

11 Scopus citations

Abstract

Initial versions of MPI were designed to work efficiently on multi-processors which had very little job control and thus static process models. Subsequently forcing them to support a dynamic process model suitable for use on clusters or distributed systems would have reduced their performance. As current HPC collaborative applications increase in size and distribution the potential levels of node and network failures increase. This is especially true when MPI implementations are used as the communication media for GRID applications where the GRID architectures themselves are inherently unreliable thus requiring new fault tolerant MPI systems to be developed. Here we present a new implementation of MPI called FT-MPI that allows the semantics and associated modes of failures to be explicitly controlled by an application via a modified MPI API. Given is an overview of the FT-MPI semantics, design, example applications and some performance issues such as efficient group communications and complex data handling. Also briefly described is the HARNESS g_hcore system that handles low-level system operations on behalf of the MPI implementation. This includes details of plug-in services developed and their interaction with the FT-MPI runtime library.

Original languageEnglish
Pages (from-to)1127-1142
Number of pages16
JournalFuture Generation Computer Systems
Volume18
Issue number8
DOIs
StatePublished - Oct 2002
Externally publishedYes

Keywords

  • FT-MPI
  • Fault tolerant message passing
  • HARNESS
  • MPI implementation
  • Meta computing

Fingerprint

Dive into the research topics of 'HARNESS fault tolerant MPI design, usage and performance issues'. Together they form a unique fingerprint.

Cite this