Abstract
In this position paper, we argue for improved fault-tolerance of an MPI code by introducing lightweight virtualization into the MPI interface. In particular, we outline key-value store semantics for MPI send/recv calls, thereby creating a far more expressive programming model. The general message passing semantics and imperative style of MPI application codes would remain essentially unchanged. However, the additional expressiblity of the programming model 1) enables the underlying transport layer to handle faulttolerance more transparently to the application developer, and 2) provides an evolutionary code path towards more declarative asynchronous programming models. The core contribution of this paper is an initial implementation of the DHARMA transport layer that provides the new, required functionality to support the MPI key-value store model.
| Original language | English |
|---|---|
| Title of host publication | FTXS 2015 - Proceedings of the 2015 Workshop on Fault Tolerance for HPC at eXtreme Scale, Part of HPDC 2015 |
| Publisher | Association for Computing Machinery, Inc |
| Pages | 41-46 |
| Number of pages | 6 |
| ISBN (Electronic) | 9781450335690 |
| DOIs | |
| State | Published - Jun 15 2015 |
| Externally published | Yes |
| Event | 5th Workshop on Fault Tolerance for HPC at eXtreme Scale, FTXS 2015 - Portland, United States Duration: Jun 15 2015 → … |
Publication series
| Name | FTXS 2015 - Proceedings of the 2015 Workshop on Fault Tolerance for HPC at eXtreme Scale, Part of HPDC 2015 |
|---|
Conference
| Conference | 5th Workshop on Fault Tolerance for HPC at eXtreme Scale, FTXS 2015 |
|---|---|
| Country/Territory | United States |
| City | Portland |
| Period | 06/15/15 → … |
Funding
The authors would like to thank Craig Ulmer, Gary Templet, and Abhinav Vishnu for useful discussions. This work was supported by the U.S. Department of Energy (DOE) Na- tional Nuclear Security Administration (NNSA) Advanced Simulation and Computing (ASC) program. Sandia Na- tional Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned sub- sidiary of Lockheed Martin Corporation, for the U.S. Department of Energy's National Nuclear Security Administration under contract DE-AC04-94AL85000