Abstract
The current system reaction to the loss of a single MPI process is to kill all the remaining processes and restart the application from the most recent checkpoint. This approach will become unfeasible for future extreme scale systems. We address this issue using an emerging resilient computing model called Local Failure Local Recovery (LFLR) that provides application developers with the ability to recover locally and continue application execution when a process is lost. We discuss the design of our software framework to enable the LFLR model using MPI-ULFM and demonstrate the resilient version of MiniFE that achieves a scalable recovery from process failures.
| Original language | English |
|---|---|
| Title of host publication | Proceedings of the 21st European MPI Users' Group Meeting, EuroMPI/ASIA 2014 |
| Publisher | Association for Computing Machinery |
| Pages | 51-56 |
| Number of pages | 6 |
| ISBN (Electronic) | 9781450328753 |
| DOIs | |
| State | Published - Sep 9 2014 |
| Externally published | Yes |
| Event | 21st European MPI Users' Group Meeting, EuroMPI/ASIA 2014 - Kyoto, Japan Duration: Sep 9 2014 → Sep 12 2014 |
Publication series
| Name | ACM International Conference Proceeding Series |
|---|---|
| Volume | 09-12-September-2014 |
Conference
| Conference | 21st European MPI Users' Group Meeting, EuroMPI/ASIA 2014 |
|---|---|
| Country/Territory | Japan |
| City | Kyoto |
| Period | 09/9/14 → 09/12/14 |
Funding
The authors would like to thank George Bosilca, Robert Clay, Pedro Diniz and Mark Hoemmen for interesting discussions related to this work. This work was supported by the U.S. Department of Energy (DOE) National Nuclear Security Administration (NNSA) Advanced Simulation and Computing (ASC) program. Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000.
Keywords
- Fault Tolerance
- MPI
- PDE solvers
- Scientific Computing
- User Level Fault Mitigation