Abstract
Integrating recent advancements in resilient algorithms and techniques into existing codes is a singular challenge in fault tolerance - in part due to the underlying complexity of implementing resilience in the first place, but also due to the difficulty introduced when integrating the functionality of a standalone new strategy with the preexisting resilience layers of an application. We propose that the answer is not to build integrated solutions for users, but runtimes designed to integrate into a larger comprehensive resilience system and thereby enable the necessary jump to multi-layered recovery. Our work designs, implements, and verifies one such comprehensive system of runtimes. Utilizing Fenix, a process resilience tool with integration into preexisting resilience systems as a design priority, we update Kokkos Resilience and the use pattern of VeloC to support application-level integration of resilience runtimes. Our work shows that designing integrable systems rather than integrated systems allows for user-designed optimization and upgrading of resilience techniques while maintaining the simplicity and performance of all-in-one resilience solutions. More application-specific choice in resilience strategies allows for better long-term flexibility, performance, and - importantly - simplicity.
| Original language | English |
|---|---|
| Title of host publication | Proceedings - 2022 IEEE International Conference on Cluster Computing, CLUSTER 2022 |
| Publisher | Institute of Electrical and Electronics Engineers Inc. |
| Pages | 418-428 |
| Number of pages | 11 |
| ISBN (Electronic) | 9781665498562 |
| DOIs | |
| State | Published - 2022 |
| Externally published | Yes |
| Event | 2022 IEEE International Conference on Cluster Computing, CLUSTER 2022 - Heidelberg, Germany Duration: Sep 6 2022 → Sep 9 2022 |
Publication series
| Name | Proceedings - IEEE International Conference on Cluster Computing, ICCC |
|---|---|
| Volume | 2022-September |
| ISSN (Print) | 1552-5244 |
Conference
| Conference | 2022 IEEE International Conference on Cluster Computing, CLUSTER 2022 |
|---|---|
| Country/Territory | Germany |
| City | Heidelberg |
| Period | 09/6/22 → 09/9/22 |
Funding
Sandia National Laboratories is a multimission laboratory managed and operated by National Technology & Engineering Solutions of Sandia, LLC, a wholly owned subsidiary of Honeywell International Inc., for the U.S. Department of Energy’s National Nuclear Security Administration (NNSA) under contract DE-NA0003525.
Keywords
- Checkpointing
- Fault Tolerance
- Fenix
- HPC
- Kokkos
- MPI-ULFM
- Resilience
Fingerprint
Dive into the research topics of 'Integrating process, control-flow, and data resiliency layers using a hybrid Fenix/Kokkos approach'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver