Check-pointing approach for fault tolerance in OpenSHMEM

Pengfei Hao, Swaroop Pophale, Pavel Shamis, Tony Curtis, Barbara Chapman

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

2 Scopus citations

Abstract

Fault tolerance for long running applications is critical to guard against failure of either compute resources or a network. Accomplishing this task in software is non-trivial and there is an added level of complexity for implementing a working model for a one-sided communications library like OpenSHMEM, since there is no matching communication call at the target processing element (PE). In this paper we explore a fault tolerance scheme based on check-point and restart, that caters to the one-sided nature of PGAS programming model while leveraging features very specific to OpenSHMEM. Through a working implementation with the 1-D Jacobi code, we show that the approach is scalable and provides considerable computational resource saving.

Original languageEnglish
Title of host publicationOpenSHMEM and Related Technologies
Subtitle of host publicationExperiences, Implementations, and Technologies - 2nd Workshop, OpenSHMEM 2015, Revised Selected Papers
EditorsManjunath Gorentla Venkata, Pavel Shamis, Neena Imam, M. Graham Lopez
PublisherSpringer Verlag
Pages36-52
Number of pages17
ISBN (Print)9783319264271
DOIs
StatePublished - 2015
Externally publishedYes
Event2nd Workshop on OpenSHMEM and Related Technologies, OpenSHMEM 2015 - Annapolis, United States
Duration: Aug 4 2015Aug 6 2015

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume9397
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference2nd Workshop on OpenSHMEM and Related Technologies, OpenSHMEM 2015
Country/TerritoryUnited States
CityAnnapolis
Period08/4/1508/6/15

Funding

This work is supported by the United States Department of Defense and used resources of the Extreme Scale Systems Center located at the Oak Ridge National Laboratory.

FundersFunder number
U.S. Department of Defense
Oak Ridge National Laboratory

    Fingerprint

    Dive into the research topics of 'Check-pointing approach for fault tolerance in OpenSHMEM'. Together they form a unique fingerprint.

    Cite this