Integrating Inter-Node Communication with a Resilient Asynchronous Many-Task Runtime System

Sri Raj Paul, Akihiro Hayashi, Matthew Whitlock, Seonmyeong Bak, Keita Teranishi, Jackson Mayo, Max Grossman, Vivek Sarkar

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

2 Scopus citations

Abstract

Achieving fault tolerance is one of the significant challenges of exascale computing due to projected increases in soft/transient failures. While past work on software-based resilience techniques typically focused on traditional bulk-synchronous parallel programming models, we believe that Asynchronous Many-Task (AMT) programming models are better suited to enabling resiliency since they provide explicit abstractions of data and tasks which contribute to increased asynchrony and latency tolerance. In this paper, we extend our past work on enabling application-level resilience in single node AMT programs by integrating the capability to perform asynchronous MPI communication, thereby enabling resiliency across multiple nodes. We also enable resilience against fail-stop errors where our runtime will manage all re-execution of tasks and communication without user intervention. Our results show that we are able to add communication operations to resilient programs with low overhead, by offloading communication to dedicated communication workers and also recover from fail-stop errors transparently, thereby enhancing productivity.

Original languageEnglish
Title of host publicationProceedings of ExaMPI 2020
Subtitle of host publicationExascale MPI Workshop, Held in conjunction with SC 2020: The International Conference for High Performance Computing, Networking, Storage and Analysis
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages41-51
Number of pages11
ISBN (Electronic)9781665415613
DOIs
StatePublished - Nov 2020
Externally publishedYes
Event2020 Exascale MPI Workshop, ExaMPI 2020 - Virtual, Atlanta, United States
Duration: Nov 13 2020 → …

Publication series

NameProceedings of ExaMPI 2020: Exascale MPI Workshop, Held in conjunction with SC 2020: The International Conference for High Performance Computing, Networking, Storage and Analysis

Conference

Conference2020 Exascale MPI Workshop, ExaMPI 2020
Country/TerritoryUnited States
CityVirtual, Atlanta
Period11/13/20 → …

Funding

Sandia National Laboratories is a multimission laboratory managed and operated by National Technology & Engineering Solutions of Sandia, LLC, a wholly owned subsidiary of Honeywell International Inc., for the U.S. Department of Energy’s National Nuclear Security Administration (NNSA) under contract DE-NA0003525. This work was funded by NNSA’s Advanced Simulation and Computing (ASC) Program. This paper describes objective technical results and analysis. Any subjective views or opinions that might be expressed in the paper do not necessarily represent the views of the U.S. Department of Energy or the United States Government. This research used resources of the National Energy Research Scientific Computing Center (NERSC), a U.S. Department of Energy Office of Science User Facility operated under Contract No. DE-AC02-05CH11231.

Keywords

  • AMT Runtimes
  • Fenix
  • Habanero C/C++
  • MPI communication
  • MPI-ULFM
  • Resilience

Fingerprint

Dive into the research topics of 'Integrating Inter-Node Communication with a Resilient Asynchronous Many-Task Runtime System'. Together they form a unique fingerprint.

Cite this