P-ckpt: Coordinated Prioritized Checkpointing

Subhendu Behera, Lipeng Wan, Frank Mueller, Matthew Wolf, Scott Klasky

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

1 Scopus citations

Abstract

Good prediction accuracy and adequate lead time to failure are key to the success of failure-aware Check-point/Restart (C/R) models on current and future large-scale High-Performance Computing (HPC) systems. This paper develops a novel checkpointing technique, called p-ckpt, that aims to maintain the performance efficiency of failure-aware C/R models even when failures are predicted with a small lead time. The p-ckpt technique is developed for HPC systems with multi-level memory systems to prioritize checkpoints from vulnerable nodes (nodes with predicted failure) in the event of failure prediction. It applies coordination among the nodes within an application so that vulnerable nodes' checkpoint data is stored to the Parallel File System (PFS) first by assigning priorities based on the lead time to failure. Vulnerable nodes thus have low-latency access on the critical path to the PFS before any failure happens. Further, we create the hybrid p-ckpt model by integrating Live Migration (LM) because of its cost-effectiveness and to reduce checkpoint frequency. Our hybrid p-ckpt C/R model considers prediction lead time and checkpoint latency to the PFS to decide on a feasible proactive action such as p-ckpt and LM. Simulations of six real-world applications for the Summit supercomputer show a ˜53-65% reduction in overhead due to the hybrid p-ckpt model compared to a ˜31-61% reduction in a state-of-the-art solution. We assess our C/R models against multiple failure distributions and consider lead time variability and failure prediction accuracy. Based on this evaluation and assessment, we discuss the trade-offs of using these models and their impact on application overhead.

Original languageEnglish
Title of host publicationProceedings - 2022 IEEE 36th International Parallel and Distributed Processing Symposium, IPDPS 2022
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages436-446
Number of pages11
ISBN (Electronic)9781665481069
DOIs
StatePublished - 2022
Event36th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2022 - Virtual, Online, France
Duration: May 30 2022Jun 3 2022

Publication series

NameProceedings - 2022 IEEE 36th International Parallel and Distributed Processing Symposium, IPDPS 2022

Conference

Conference36th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2022
Country/TerritoryFrance
CityVirtual, Online
Period05/30/2206/3/22

Funding

We would like to thank the reviewers for their valuable feedback. This research was supported in part by NSF grants 1525609, 1813004, 1818914, DOE ASCR SIRIUS-2 project and Exascale Computing Project (17-SC-20-SC). This research used resources of the Oak Ridge Leadership Computing Facility at the Oak Ridge National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC05-00OR22725.

Keywords

  • Burst Buffers
  • Checkpoint/Restart
  • Failure Prediction
  • Fault Tolerance
  • High-Performance Computing
  • I/O subsystem
  • Live Migration

Fingerprint

Dive into the research topics of 'P-ckpt: Coordinated Prioritized Checkpointing'. Together they form a unique fingerprint.

Cite this