Fault tolerance for OpenSHMEM

Pengfei Hao, Swaroop Pophale, Pavel Shamis, Aaron Welch, Barbara Chapman, Manjunath Gorentla Venkata, Stephen Poole

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

1 Scopus citations

Abstract

On today's supercomputing systems, faults are becoming a norm rather than an exception. Given the complexity required for achieving expected scalability and performance on future systems, this situation is expected to become worse. The systems are expected to function in a nearly constant presence of faults. To be productive on these systems, programming models will require both hardware and software to be resilient to faults. With the growing importance of PGAS programming model and OpenSHMEM, as a part of HPC software stack, a lack of a fault tolerance model may become a liability for its users. Towards this end, in this paper, we discuss the viability of using checkpoint/restart as a fault-tolerance method for OpenSHMEM, propose a selective checkpoint/restart fault-tolerance model, and discuss challenges associated with implementing the proposed model. Copyright is held by the owner/author(s). Publication rights licensed to ACM.

Original languageEnglish
Title of host publicationProceedings of the 8th International Conference on Partitioned Global Address Space Programming Models, PGAS 2014
EditorsNadine Couture, David Broman, Christian Bastien, David Broman, Tomas Dorta, Peter Pepper
PublisherAssociation for Computing Machinery
ISBN (Electronic)9781450329538, 9781450329705, 9781450330312, 9781450331883, 9781450332477
DOIs
StatePublished - Oct 6 2014
Externally publishedYes
Event8th International Conference on Partitioned Global Address Space Programming Models, PGAS 2014 - Eugene, United States
Duration: Oct 6 2014Oct 10 2014

Publication series

NameACM International Conference Proceeding Series
Volume2014-October

Conference

Conference8th International Conference on Partitioned Global Address Space Programming Models, PGAS 2014
Country/TerritoryUnited States
CityEugene
Period10/6/1410/10/14

Funding

FundersFunder number
Oak Ridge National Laboratory

    Fingerprint

    Dive into the research topics of 'Fault tolerance for OpenSHMEM'. Together they form a unique fingerprint.

    Cite this