Abstract
Initial versions of MPI were designed to work efficiently on multi-processors which had very little job control and thus static process models. Subsequently forcing them to support a dynamic process model would have affected their performance. As current HPC systems increase in size with greater potential levels of individual node failure, the need arises for new fault tolerant systems to be developed. Here we present a new implementation of MPI called fault tolerant MPI (FT-MPI) that allows the semantics and associated modes of failures to be explicitly controlled by an application via a modified MPI API. Given is an overview of the FT-MPI semantics, design, example applications, debugging tools and some performance issues. Also discussed is the experimental HARNESS core (G_HCORE) implementation that FT-MPI is built to operate upon.
Original language | English |
---|---|
Pages (from-to) | 1479-1495 |
Number of pages | 17 |
Journal | Parallel Computing |
Volume | 27 |
Issue number | 11 |
DOIs | |
State | Published - Oct 2001 |
Externally published | Yes |
Keywords
- Fault tolerant application
- Message passing
- Metacomputing middleware
- Parallel computing