Abstract
The lack of fault tolerance is becoming a limiting factor for application scalability in HPC systems. The MPI does not provide standardized fault tolerance interfaces and semantics. The MPI Forum's Fault Tolerance Working Group is proposing a collective fault tolerant agreement algorithm for the next MPI standard. Such algorithms play a central role in many fault tolerant applications. This paper combines a log-scaling two-phase commit agreement algorithm with a reduction operation to provide the necessary functionality for the new collective without any additional messages. Error handling mechanisms are described that preserve the fault tolerance properties while maintaining overall scalability.
| Original language | English |
|---|---|
| Title of host publication | Recent Advances in the Message Passing Interface - 18th European MPI Users' Group Meeting, EuroMPI 2011, Proceedings |
| Pages | 255-263 |
| Number of pages | 9 |
| DOIs | |
| State | Published - 2011 |
| Event | 18th European Message Passing Interface Users' Group Meeting, EuroMPI 2011 - Santorini, Greece Duration: Sep 18 2011 → Sep 21 2011 |
Publication series
| Name | Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) |
|---|---|
| Volume | 6960 LNCS |
| ISSN (Print) | 0302-9743 |
| ISSN (Electronic) | 1611-3349 |
Conference
| Conference | 18th European Message Passing Interface Users' Group Meeting, EuroMPI 2011 |
|---|---|
| Country/Territory | Greece |
| City | Santorini |
| Period | 09/18/11 → 09/21/11 |
Funding
Acknowledgments. Research sponsored by the Mathematical, Information, and Computational Sciences Division, Office of Advanced Scientific Computing Research, U.S. Department of Energy, under Contract No. DE-AC05-00OR22725 with UT-Battelle, LLC.
Keywords
- Agreement Protocol
- Algorithm Based Fault Tolerance
- Fault Tolerance
- MPI
- Run-through Stabilization