Asynchrony and Failure Masking via Pseudo-Local Process Recovery in MPI Applications

  • Mathhew Whitlock
  • , Hemanth Kolla
  • , Aurelien Bouteiller
  • , Jackson R. Mayo
  • , Nicolas M. Morales
  • , Keita Teranishi
  • , George Bosilca

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

For parallel solvers susceptible to hardware-related failures, localizing recovery to the processes directly affected by the failure allows preserving asynchronous progress and exhibits 'failure masking' due to limited propagation of recovery delays. This results in improved scalability compared to global recovery which is a disproportionate response. However, localizing recovery from hard failures is challenging because such failures are not transparent to the MPI runtime, requiring reconstruction of the communication layers and of a consistent application state. In this work we present the process- and data-recovery concepts that enable the performance and scalability of localized recovery despite the inherently non-local nature of some recovery steps. We present design enhancements to existing resilience middleware-the Fenix library and MPI User-Level Failure Mitigation-to robustly support larger-scale execution and 'pseudo-local' checkpointing and recovery from many process failures. Using an example stencil solver with emulated hard failures we present an experimental evaluation, with runs on up to 1000 ranks subject to 100 process failures, which confirms that that pseudo-local recovery has significantly improved weak scaling compared to the roughly exponential slowdown of global recovery. Our work shows how fault tolerance infrastructure originally designed for global checkpoint/restart can be repurposed to enable greater efficiency in a resilience-aware application.

Original languageEnglish
Title of host publication2024 IEEE International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2024
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages1164-1166
Number of pages3
ISBN (Electronic)9798350364606
DOIs
StatePublished - 2024
Event2024 IEEE International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2024 - San Francisco, United States
Duration: May 27 2024May 31 2024

Publication series

Name2024 IEEE International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2024

Conference

Conference2024 IEEE International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2024
Country/TerritoryUnited States
CitySan Francisco
Period05/27/2405/31/24

Funding

Sandia National Laboratories is a multimission laboratory managed and operated by National Technology & Engineering Solutions of Sandia, LLC, a wholly owned subsidiary of Honeywell International Inc., for the U.S. Department of Energy s National Nuclear Security Administration (NNSA) under contract DE-NA0003525.

Keywords

  • Failure Masking
  • Fault Tolerance
  • Local Recovery

Fingerprint

Dive into the research topics of 'Asynchrony and Failure Masking via Pseudo-Local Process Recovery in MPI Applications'. Together they form a unique fingerprint.

Cite this