Abstract
Concern is growing in the high-performance computing (HPC) community on the reliability of future extreme- scale systems. Current efforts have focused on appli- cation fault-tolerance rather than the operating system (OS), despite the fact that recent studies have suggested that failures in OS memory may be more likely. The OS is critical to a system's correct and efficient operation of the node and processes it governs-and the parallel na- ture of HPC applications means any single node failure generally forces all processes of this application to ter- minate due to tight communication in HPC. Therefore, the OS itself must be capable of tolerating failures in a robust system. In this work, we introduce mini-ckpts, a framework which enables application survival despite the occurrence of a fatal OS failure or crash. minickpts achieves this tolerance by ensuring that the crit- ical data describing a process is preserved in persistent memory prior to the failure. Following the failure, the OS is rejuvenated via a warm reboot and the applica- tion continues execution effectively making the failure and restart transparent. The mini-ckpts rejuvenation and recovery process is measured to take between three to six seconds and has a failure-free overhead of between 3-5% for a number of key HPC workloads. In contrast to current fault-tolerance methods, this work ensures that the operating and runtime systems can continue in the presence of faults. This is a much finer-grained and dynamic method of fault-tolerance than the current coarse-grained application-centric methods. Handling faults at this level has the potential to greatly reduce overheads and enables mitigation of additional faults.
| Original language | English |
|---|---|
| Title of host publication | Proceedings of the 2016 International Conference on Supercomputing, ICS 2016 |
| Publisher | Association for Computing Machinery |
| ISBN (Electronic) | 9781450343619 |
| DOIs | |
| State | Published - Jun 1 2016 |
| Event | 30th International Conference on Supercomputing, ICS 2016 - Istanbul, Turkey Duration: Jun 1 2016 → Jun 3 2016 |
Publication series
| Name | Proceedings of the International Conference on Supercomputing |
|---|---|
| Volume | 01-03-June-2016 |
Conference
| Conference | 30th International Conference on Supercomputing, ICS 2016 |
|---|---|
| Country/Territory | Turkey |
| City | Istanbul |
| Period | 06/1/16 → 06/3/16 |
Funding
This work was supported in part by a subcontract from Sandia National Laboratories and NSF grants CNS-1058779, CNS-0958311.
Fingerprint
Dive into the research topics of 'Mini-Ckpts: Surviving OS failures in persistent memory'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver