Integrating process, control-flow, and data resiliency layers using a hybrid Fenix/Kokkos approach

  • Matthew Whitlock
  • , Nicolas Morales
  • , George Bosilca
  • , Aurelien Bouteiller
  • , Bogdan Nicolae
  • , Keita Teranishi
  • , Elisabeth Giem
  • , Vivek Sarkar

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

4 Scopus citations

Abstract

Integrating recent advancements in resilient algorithms and techniques into existing codes is a singular challenge in fault tolerance - in part due to the underlying complexity of implementing resilience in the first place, but also due to the difficulty introduced when integrating the functionality of a standalone new strategy with the preexisting resilience layers of an application. We propose that the answer is not to build integrated solutions for users, but runtimes designed to integrate into a larger comprehensive resilience system and thereby enable the necessary jump to multi-layered recovery. Our work designs, implements, and verifies one such comprehensive system of runtimes. Utilizing Fenix, a process resilience tool with integration into preexisting resilience systems as a design priority, we update Kokkos Resilience and the use pattern of VeloC to support application-level integration of resilience runtimes. Our work shows that designing integrable systems rather than integrated systems allows for user-designed optimization and upgrading of resilience techniques while maintaining the simplicity and performance of all-in-one resilience solutions. More application-specific choice in resilience strategies allows for better long-term flexibility, performance, and - importantly - simplicity.

Original languageEnglish
Title of host publicationProceedings - 2022 IEEE International Conference on Cluster Computing, CLUSTER 2022
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages418-428
Number of pages11
ISBN (Electronic)9781665498562
DOIs
StatePublished - 2022
Externally publishedYes
Event2022 IEEE International Conference on Cluster Computing, CLUSTER 2022 - Heidelberg, Germany
Duration: Sep 6 2022Sep 9 2022

Publication series

NameProceedings - IEEE International Conference on Cluster Computing, ICCC
Volume2022-September
ISSN (Print)1552-5244

Conference

Conference2022 IEEE International Conference on Cluster Computing, CLUSTER 2022
Country/TerritoryGermany
CityHeidelberg
Period09/6/2209/9/22

Funding

Sandia National Laboratories is a multimission laboratory managed and operated by National Technology & Engineering Solutions of Sandia, LLC, a wholly owned subsidiary of Honeywell International Inc., for the U.S. Department of Energy’s National Nuclear Security Administration (NNSA) under contract DE-NA0003525.

Keywords

  • Checkpointing
  • Fault Tolerance
  • Fenix
  • HPC
  • Kokkos
  • MPI-ULFM
  • Resilience

Fingerprint

Dive into the research topics of 'Integrating process, control-flow, and data resiliency layers using a hybrid Fenix/Kokkos approach'. Together they form a unique fingerprint.

Cite this