A log-scaling fault tolerant agreement algorithm for a fault tolerant MPI

Joshua Hursey, Thomas Naughton, Geoffroy Vallee, Richard L. Graham

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

17 Scopus citations

Abstract

The lack of fault tolerance is becoming a limiting factor for application scalability in HPC systems. The MPI does not provide standardized fault tolerance interfaces and semantics. The MPI Forum's Fault Tolerance Working Group is proposing a collective fault tolerant agreement algorithm for the next MPI standard. Such algorithms play a central role in many fault tolerant applications. This paper combines a log-scaling two-phase commit agreement algorithm with a reduction operation to provide the necessary functionality for the new collective without any additional messages. Error handling mechanisms are described that preserve the fault tolerance properties while maintaining overall scalability.

Original languageEnglish
Title of host publicationRecent Advances in the Message Passing Interface - 18th European MPI Users' Group Meeting, EuroMPI 2011, Proceedings
Pages255-263
Number of pages9
DOIs
StatePublished - 2011
Event18th European Message Passing Interface Users' Group Meeting, EuroMPI 2011 - Santorini, Greece
Duration: Sep 18 2011Sep 21 2011

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume6960 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference18th European Message Passing Interface Users' Group Meeting, EuroMPI 2011
Country/TerritoryGreece
CitySantorini
Period09/18/1109/21/11

Funding

Acknowledgments. Research sponsored by the Mathematical, Information, and Computational Sciences Division, Office of Advanced Scientific Computing Research, U.S. Department of Energy, under Contract No. DE-AC05-00OR22725 with UT-Battelle, LLC.

FundersFunder number
U.S. Department of Energy
Advanced Scientific Computing Research

    Keywords

    • Agreement Protocol
    • Algorithm Based Fault Tolerance
    • Fault Tolerance
    • MPI
    • Run-through Stabilization

    Fingerprint

    Dive into the research topics of 'A log-scaling fault tolerant agreement algorithm for a fault tolerant MPI'. Together they form a unique fingerprint.

    Cite this