Abstract
The lack of fault tolerance is becoming a limiting factor for application scalability in HPC systems. The MPI does not provide standardized fault tolerance interfaces and semantics. The MPI Forum's Fault Tolerance Working Group is proposing a collective fault tolerant agreement algorithm for the next MPI standard. Such algorithms play a central role in many fault tolerant applications. This paper combines a log-scaling two-phase commit agreement algorithm with a reduction operation to provide the necessary functionality for the new collective without any additional messages. Error handling mechanisms are described that preserve the fault tolerance properties while maintaining overall scalability.
Original language | English |
---|---|
Title of host publication | Recent Advances in the Message Passing Interface - 18th European MPI Users' Group Meeting, EuroMPI 2011, Proceedings |
Pages | 255-263 |
Number of pages | 9 |
DOIs | |
State | Published - 2011 |
Event | 18th European Message Passing Interface Users' Group Meeting, EuroMPI 2011 - Santorini, Greece Duration: Sep 18 2011 → Sep 21 2011 |
Publication series
Name | Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) |
---|---|
Volume | 6960 LNCS |
ISSN (Print) | 0302-9743 |
ISSN (Electronic) | 1611-3349 |
Conference
Conference | 18th European Message Passing Interface Users' Group Meeting, EuroMPI 2011 |
---|---|
Country/Territory | Greece |
City | Santorini |
Period | 09/18/11 → 09/21/11 |
Funding
Acknowledgments. Research sponsored by the Mathematical, Information, and Computational Sciences Division, Office of Advanced Scientific Computing Research, U.S. Department of Energy, under Contract No. DE-AC05-00OR22725 with UT-Battelle, LLC.
Keywords
- Agreement Protocol
- Algorithm Based Fault Tolerance
- Fault Tolerance
- MPI
- Run-through Stabilization