Scalable fault tolerant MPI: Extending the recovery algorithm

Graham E. Fagg, Thara Angskun, George Bosilca, Jelena Pjesivac-Grbovic, Jack J. Dongarra

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

4 Scopus citations

Abstract

Fault Tolerant. MPI (FT-MPI)[6] was designed as a solution to allow applications different methods to handle process failures boyond simple check-point restart schemes. The initial implementation of FT-MPI included a robust heavy weight system state recovery algorithm that was designed to manage the membership of MPI communicators during multiple failures. The algorithm and its implementation although robust, was very conservative and this effected its scalability on both very large clusters as well as on distributed systems. Thy paper details the FT-MPI recovery algorithm and our initial experiments with new recovery algorithms that are aimed at being both scalable and latency tolerant. Our conclusions shows that the use of both topology aware collective communication and distributed consensus algorithms together produce the best results.

Original languageEnglish
Title of host publicationLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Pages67-75
Number of pages9
DOIs
StatePublished - 2005
Externally publishedYes
Event12th European PVM/MPI Users' Group Meeting - Recent Advances in Parallel Virtual Machine and Message Passing Interface - Sorrento, Italy
Duration: Sep 18 2005Sep 21 2005

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume3666 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference12th European PVM/MPI Users' Group Meeting - Recent Advances in Parallel Virtual Machine and Message Passing Interface
Country/TerritoryItaly
CitySorrento
Period09/18/0509/21/05

Fingerprint

Dive into the research topics of 'Scalable fault tolerant MPI: Extending the recovery algorithm'. Together they form a unique fingerprint.

Cite this