Abstract
As fault recovery mechanisms become increasingly important in HPC systems, the need for a new recovery model for workflows on these systems grows as well. While the traditional approach in which each system component attempts its own independent recovery after a fault works well at each individual application level, this model does not scale to the new level demanded by workflow-level exception handling. As today's workflows must often run many components simultaneously (e.g., workflow manager components, many simulation instances, data analytics etc), any uncoordinated model can quickly result in redundant or contradictory recovery actions. In this paper, we propose a multi-level cooperative exception model (MCEM), a novel exception handling approach that solves this coordination challenge for HPC workflows. We present our model, describe how it can be applied to common system faults and other workflow specific exceptions, and demonstrate how it reduces redundant I/O in the case of a file-system quota exception.
| Original language | English |
|---|---|
| Title of host publication | ROSS 2019 - Proceedings of the 9th International Workshop on Runtime and Operating Systems for Supercomputers, co-located with HPDC 2019 |
| Publisher | Association for Computing Machinery, Inc |
| Pages | 27-32 |
| Number of pages | 6 |
| ISBN (Electronic) | 9781450367554 |
| DOIs | |
| State | Published - Jun 17 2019 |
| Event | 9th International Workshop on Runtime and Operating Systems for Supercomputers, ROSS 2019, co-located with HPDC 2019 - Phoenix, United States Duration: Jun 25 2019 → … |
Publication series
| Name | ROSS 2019 - Proceedings of the 9th International Workshop on Runtime and Operating Systems for Supercomputers, co-located with HPDC 2019 |
|---|
Conference
| Conference | 9th International Workshop on Runtime and Operating Systems for Supercomputers, ROSS 2019, co-located with HPDC 2019 |
|---|---|
| Country/Territory | United States |
| City | Phoenix |
| Period | 06/25/19 → … |
Funding
This article has been authored by Lawrence Livermore National Security, LLC under Contract No. DE-AC52-07NA27344 with the U.S. Department of Energy. Accordingly, the United States Government retains and the publisher, by accepting the article for publication, acknowledges that the United States Government retains a non-exclusive, paid-up, irrevocable, world-wide license to publish or reproduce the published form of this article or allow others to do so, for United States Government purposes (LLNL-CONF-771601).
Keywords
- Coordinated recovery
- Fault-tolerance
- Workflow exception model