Fault tolerant MPI for the HARNESS Meta-computing system

Graham E. Fagg, Antonin Bukovsky, Jack J. Dongarra

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

6 Scopus citations

Abstract

Initial versions of MPI were designed to work efficiently on multiprocessors which had very little job control and thus static process models. Subsequently forcing them to support a dynamic process model suitable for use on clusters or distributed systems would have reduced their performance. As current HPC collaborative applications increase in size and distribution the potential levels of node and network failures increase the need arises for new fault tolerant systems to be developed. Here we present a new implementation of MPI called FT-MPI that allows the semantics and associated modes of failures to be explicitly controlled by an application via a modified MPI API. Given is an overview of the FT-MPI semantics, design, example applications and some performance issues such as efficient group communications and complex data handling.

Original languageEnglish
Title of host publicationComputational Science - ICCS 2001 - International Conference, 2001, Proceedings
EditorsVassil N. Alexandrov, Jack J. Dongarra, Benjoe A. Juliano, René S. Renner, C.J. Kenneth Tan
PublisherSpringer Verlag
Pages355-366
Number of pages12
ISBN (Print)3540422323, 9783540422327
DOIs
StatePublished - 2001
Externally publishedYes
EventInternational Conference on Computational Science, ICCS 2001 - San Francisco, United States
Duration: May 28 2001May 30 2001

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume2073
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

ConferenceInternational Conference on Computational Science, ICCS 2001
Country/TerritoryUnited States
CitySan Francisco
Period05/28/0105/30/01

Fingerprint

Dive into the research topics of 'Fault tolerant MPI for the HARNESS Meta-computing system'. Together they form a unique fingerprint.

Cite this