Abstract
In this paper, we address the design challenge of building multiresilient iterative high-performance computing (HPC) applications. Multiresilience in HPC applications is the ability to tolerate and maintain forward progress in the presence of both soft errors and process failures. We address the challenge by proposing performance models which are useful to design performance efficient and resilient iterative applications. The models consider the interaction between soft error and process failure resilience solutions. We experimented with a linear solver application with two distinct kinds of soft error detectors: one detector has high overhead and high accuracy, whereas the second has low overhead and low accuracy. We show how both can be leveraged for verifying the integrity of checkpointed state used to recover from both soft errors and process failures. Our results show the performance efficiency and resiliency benefit of employing the low overhead detector with high frequency within the checkpoint interval, so that timely soft error recovery can take place, resulting in less re-computed work.
Original language | English |
---|---|
Title of host publication | Euro-Par 2018 |
Subtitle of host publication | Parallel Processing Workshops - Euro-Par 2018 International Workshops, Revised Selected Papers |
Editors | Gabriele Mencagli, Dora B. Heras |
Publisher | Springer Verlag |
Pages | 813-825 |
Number of pages | 13 |
ISBN (Print) | 9783030105488 |
DOIs | |
State | Published - 2019 |
Event | 24th International Conference on Parallel and Distributed Computing, Euro-Par 2018 - Turin, Italy Duration: Aug 27 2018 → Aug 28 2018 |
Publication series
Name | Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) |
---|---|
Volume | 11339 LNCS |
ISSN (Print) | 0302-9743 |
ISSN (Electronic) | 1611-3349 |
Conference
Conference | 24th International Conference on Parallel and Distributed Computing, Euro-Par 2018 |
---|---|
Country/Territory | Italy |
City | Turin |
Period | 08/27/18 → 08/28/18 |
Funding
Process failures · Fault injection · Checkpoint restart Design patterns · Iterative algorithms · Linear solver · Performance Analytical models This work was sponsored by the U.S. Department of Energy’s Office of Advanced Scientific Computing Research. This manuscript has been authored by UT-Battelle, LLC under Contract No. DE-AC05-00OR22725 with the U.S. Department of Energy. The United States Government retains and the publisher, by accepting the article for publication, acknowledges that the United States Government retains a non-exclusive, paid-up, irrevocable, world-wide license to publish or reproduce the published form of this manuscript, or allow others to do so, for United States Government purposes. The Department of Energy will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan (http://energy.gov/ downloads/doe-public-access-plan). Acknowledgements. This material is based upon work supported by the U.S. Department of Energy, Office of Science, Office of Advanced Scientific Computing Research, program manager Lucy Nowell, under contract number DE-AC05-00OR22725.
Keywords
- Analytical models
- Checkpoint restart
- Design patterns
- Fault injection
- High-performance computing
- Iterative algorithms
- Linear solver
- Performance
- Process failures
- Resilience
- Soft errors