Abstract
Fault tolerance for long running applications is critical to guard against failure of either compute resources or a network. Accomplishing this task in software is non-trivial and there is an added level of complexity for implementing a working model for a one-sided communications library like OpenSHMEM, since there is no matching communication call at the target processing element (PE). In this paper we explore a fault tolerance scheme based on check-point and restart, that caters to the one-sided nature of PGAS programming model while leveraging features very specific to OpenSHMEM. Through a working implementation with the 1-D Jacobi code, we show that the approach is scalable and provides considerable computational resource saving.
Original language | English |
---|---|
Title of host publication | OpenSHMEM and Related Technologies |
Subtitle of host publication | Experiences, Implementations, and Technologies - 2nd Workshop, OpenSHMEM 2015, Revised Selected Papers |
Editors | Manjunath Gorentla Venkata, Pavel Shamis, Neena Imam, M. Graham Lopez |
Publisher | Springer Verlag |
Pages | 36-52 |
Number of pages | 17 |
ISBN (Print) | 9783319264271 |
DOIs | |
State | Published - 2015 |
Externally published | Yes |
Event | 2nd Workshop on OpenSHMEM and Related Technologies, OpenSHMEM 2015 - Annapolis, United States Duration: Aug 4 2015 → Aug 6 2015 |
Publication series
Name | Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) |
---|---|
Volume | 9397 |
ISSN (Print) | 0302-9743 |
ISSN (Electronic) | 1611-3349 |
Conference
Conference | 2nd Workshop on OpenSHMEM and Related Technologies, OpenSHMEM 2015 |
---|---|
Country/Territory | United States |
City | Annapolis |
Period | 08/4/15 → 08/6/15 |
Funding
This work is supported by the United States Department of Defense and used resources of the Extreme Scale Systems Center located at the Oak Ridge National Laboratory.