Abstract
Application resilience is a key challenge that must be ad-dressed in order to realize the exascale vision. Previous work has shown that online recovery, even when done in a global manner (i.e., involving all processes), can dramatically re-duce the overhead of failures when compared to the more traditional approach of terminating the job and restarting it from the last stored checkpoint. In this paper we suggest going one step further, and explore how local recovery can be used for certain classes of applications to reduce the over-heads due to failures. Specifically we study the feasibility of local recovery for stencil-based parallel applications and we show how multiple independent failures can be masked to effectively reduce the impact on the total time to solution.
| Original language | English |
|---|---|
| Title of host publication | HPDC 2015 - Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing |
| Publisher | Association for Computing Machinery, Inc |
| Pages | 279-282 |
| Number of pages | 4 |
| ISBN (Electronic) | 9781450335508 |
| DOIs | |
| State | Published - Jun 15 2015 |
| Externally published | Yes |
| Event | 24th ACM Symposium on High-Performance Parallel and Distributed Computing, HPDC 2015 - Portland, United States Duration: Jun 15 2015 → Jun 19 2015 |
Publication series
| Name | HPDC 2015 - Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing |
|---|
Conference
| Conference | 24th ACM Symposium on High-Performance Parallel and Distributed Computing, HPDC 2015 |
|---|---|
| Country/Territory | United States |
| City | Portland |
| Period | 06/15/15 → 06/19/15 |
Funding
The authors would like to thank Josep Gamell, Robert Clay and George Bosilca for interesting discussions related to this work. The research presented in this work is supported in part by National Science Foundation (NSF) via grants numbers ACI 1339036, ACI 1310283, CNS 1305375, and DMS 1228203, by the Director, Office of Advanced Scien-tific Computing Research, Office of Science, of the US De-partment of Energy Scientific Discovery through Advanced Computing (SciDAC) Institute of Scalable Data Manage-ment, Analysis and Visualization (SDAV) under ward num-ber DE-SC0007455, the Advanced Scientific Computing Re-search and Fusion Energy Sciences Partnership for Edge Physics Simulations (EPSI) under award number DE-FG02-06ER54857, the ExaCT Combustion Co-Design Center via subcontract number 4000110839 from UT Battelle, and by an IBM Faculty Award. The research at Rutgers was con-ducted as part of the Rutgers Discovery Informatics In-stitute (RDI2). Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Cor-poration, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energy's National Nuclear Security Administration under contract DE-AC04-94AL85000.