Skip to main navigation Skip to search Skip to main content

Mini-Ckpts: Surviving OS failures in persistent memory

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

4 Scopus citations

Abstract

Concern is growing in the high-performance computing (HPC) community on the reliability of future extreme- scale systems. Current efforts have focused on appli- cation fault-tolerance rather than the operating system (OS), despite the fact that recent studies have suggested that failures in OS memory may be more likely. The OS is critical to a system's correct and efficient operation of the node and processes it governs-and the parallel na- ture of HPC applications means any single node failure generally forces all processes of this application to ter- minate due to tight communication in HPC. Therefore, the OS itself must be capable of tolerating failures in a robust system. In this work, we introduce mini-ckpts, a framework which enables application survival despite the occurrence of a fatal OS failure or crash. minickpts achieves this tolerance by ensuring that the crit- ical data describing a process is preserved in persistent memory prior to the failure. Following the failure, the OS is rejuvenated via a warm reboot and the applica- tion continues execution effectively making the failure and restart transparent. The mini-ckpts rejuvenation and recovery process is measured to take between three to six seconds and has a failure-free overhead of between 3-5% for a number of key HPC workloads. In contrast to current fault-tolerance methods, this work ensures that the operating and runtime systems can continue in the presence of faults. This is a much finer-grained and dynamic method of fault-tolerance than the current coarse-grained application-centric methods. Handling faults at this level has the potential to greatly reduce overheads and enables mitigation of additional faults.

Original languageEnglish
Title of host publicationProceedings of the 2016 International Conference on Supercomputing, ICS 2016
PublisherAssociation for Computing Machinery
ISBN (Electronic)9781450343619
DOIs
StatePublished - Jun 1 2016
Event30th International Conference on Supercomputing, ICS 2016 - Istanbul, Turkey
Duration: Jun 1 2016Jun 3 2016

Publication series

NameProceedings of the International Conference on Supercomputing
Volume01-03-June-2016

Conference

Conference30th International Conference on Supercomputing, ICS 2016
Country/TerritoryTurkey
CityIstanbul
Period06/1/1606/3/16

Funding

This work was supported in part by a subcontract from Sandia National Laboratories and NSF grants CNS-1058779, CNS-0958311.

Fingerprint

Dive into the research topics of 'Mini-Ckpts: Surviving OS failures in persistent memory'. Together they form a unique fingerprint.

Cite this