MCEM: Multi-level cooperative exception model for HPC workflows

Stephen Herbein, David Domyancic, Paul Minner, Ignacio Laguna, Rafael Ferreira Da Silva, Dong H. Ahn

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

As fault recovery mechanisms become increasingly important in HPC systems, the need for a new recovery model for workflows on these systems grows as well. While the traditional approach in which each system component attempts its own independent recovery after a fault works well at each individual application level, this model does not scale to the new level demanded by workflow-level exception handling. As today's workflows must often run many components simultaneously (e.g., workflow manager components, many simulation instances, data analytics etc), any uncoordinated model can quickly result in redundant or contradictory recovery actions. In this paper, we propose a multi-level cooperative exception model (MCEM), a novel exception handling approach that solves this coordination challenge for HPC workflows. We present our model, describe how it can be applied to common system faults and other workflow specific exceptions, and demonstrate how it reduces redundant I/O in the case of a file-system quota exception.

Original languageEnglish
Title of host publicationROSS 2019 - Proceedings of the 9th International Workshop on Runtime and Operating Systems for Supercomputers, co-located with HPDC 2019
PublisherAssociation for Computing Machinery, Inc
Pages27-32
Number of pages6
ISBN (Electronic)9781450367554
DOIs
StatePublished - Jun 17 2019
Event9th International Workshop on Runtime and Operating Systems for Supercomputers, ROSS 2019, co-located with HPDC 2019 - Phoenix, United States
Duration: Jun 25 2019 → …

Publication series

NameROSS 2019 - Proceedings of the 9th International Workshop on Runtime and Operating Systems for Supercomputers, co-located with HPDC 2019

Conference

Conference9th International Workshop on Runtime and Operating Systems for Supercomputers, ROSS 2019, co-located with HPDC 2019
Country/TerritoryUnited States
CityPhoenix
Period06/25/19 → …

Funding

This article has been authored by Lawrence Livermore National Security, LLC under Contract No. DE-AC52-07NA27344 with the U.S. Department of Energy. Accordingly, the United States Government retains and the publisher, by accepting the article for publication, acknowledges that the United States Government retains a non-exclusive, paid-up, irrevocable, world-wide license to publish or reproduce the published form of this article or allow others to do so, for United States Government purposes (LLNL-CONF-771601).

Keywords

  • Coordinated recovery
  • Fault-tolerance
  • Workflow exception model

Fingerprint

Dive into the research topics of 'MCEM: Multi-level cooperative exception model for HPC workflows'. Together they form a unique fingerprint.

Cite this