Improving Scalability of Silent-Error Resilience for Message-Passing Solvers via Local Recovery and Asynchrony

Hemanth Kolla, Jackson R. Mayo, Keita Teranishi, Robert C. Armstrong

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

3 Scopus citations

Abstract

Benefits of local recovery (restarting only a failed process or task) have been previously demonstrated in parallel solvers. Local recovery has a reduced impact on application performance due to masking of failure delays (for message-passing codes) or dynamic load balancing (for asynchronous many-task codes). In this paper, we implement MPI-process-local checkpointing and recovery of data (as an extension of the Fenix library) in combination with an existing method for local detection of silent errors in partial-differential-equation solvers, to show a path for incorporating lightweight silent-error resilience. In addition, we demonstrate how asynchrony introduced by maximizing computation-communication overlap can halt the propagation of delays. For a prototype stencil solver (including an iterative-solver-like variant) with injected memory bit flips, results show greatly reduced overhead under weak scaling compared to global recovery, and high failure-masking efficiency. The approach is expected to be generalizable to other MPI-based solvers.

Original languageEnglish
Title of host publicationProceedings of FTXS 2020
Subtitle of host publicationFault Tolerance for HPC at eXtreme Scale, Held in conjunction with SC 2020: The International Conference for High Performance Computing, Networking, Storage and Analysis
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages1-10
Number of pages10
ISBN (Electronic)9781665422895
DOIs
StatePublished - Nov 2020
Externally publishedYes
Event10th IEEE/ACM Workshop on Fault Tolerance for HPC at eXtreme Scale, FTXS 2020 - Virtual, Atlanta, United States
Duration: Nov 11 2020 → …

Publication series

NameProceedings of FTXS 2020: Fault Tolerance for HPC at eXtreme Scale, Held in conjunction with SC 2020: The International Conference for High Performance Computing, Networking, Storage and Analysis

Conference

Conference10th IEEE/ACM Workshop on Fault Tolerance for HPC at eXtreme Scale, FTXS 2020
Country/TerritoryUnited States
CityVirtual, Atlanta
Period11/11/20 → …

Funding

ACKNOWLEDGMENT Sandia National Laboratories is a multimission laboratory managed and operated by National Technology & Engineering Solutions of Sandia, LLC, a wholly owned subsidiary of Hon-eywell International Inc., for the U.S. Department of Energy’s National Nuclear Security Administration (NNSA) under contract DE-NA0003525. This work was funded by NNSA’s Advanced Simulation and Computing (ASC) Program. This paper describes objective technical results and analysis. Any subjective views or opinions that might be expressed in the paper do not necessarily represent the views of the U.S. Department of Energy or the United States Government.

Fingerprint

Dive into the research topics of 'Improving Scalability of Silent-Error Resilience for Message-Passing Solvers via Local Recovery and Asynchrony'. Together they form a unique fingerprint.

Cite this