Orchestrating Fault Prediction with Live Migration and Checkpointing

Subhendu Behera, Lipeng Wan, Frank Mueller, Matthew Wolf, Scott Klasky

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

8 Scopus citations

Abstract

Checkpoint/Restart (C/R) is widely used to provide fault tolerance on High-Performance Computing (HPC) systems. However, Parallel File System (PFS) overhead and failure uncertainty cause significant application overhead. This paper develops an adaptive multi-level C/R model that incorporates a failure prediction and analysis model, which orchestrates failure prediction, checkpointing, checkpoint frequency, and proactive live migration along with the additional benefit of Burst Buffers (BB). It effectively reduces the overheads due to failures, checkpointing, and recovery. Simulation results for the Summit supercomputer yield a reduction of ∼20%-86% in application overhead due to BBs, orchestrated failure prediction, and migration. We also observe a ∼29% decrease in checkpoint writes to BBs, which can increase the longevity of the BB storage devices.

Original languageEnglish
Title of host publicationHPDC 2020 - Proceedings of the 29th International Symposium on High-Performance Parallel and Distributed Computing
PublisherAssociation for Computing Machinery, Inc
Pages167-171
Number of pages5
ISBN (Electronic)9781450370523
DOIs
StatePublished - Jun 23 2020
Event29th International Symposium on High-Performance Parallel and Distributed Computing, HPDC 2020 - Stockholm, Sweden
Duration: Jun 23 2020Jun 26 2020

Publication series

NameHPDC 2020 - Proceedings of the 29th International Symposium on High-Performance Parallel and Distributed Computing

Conference

Conference29th International Symposium on High-Performance Parallel and Distributed Computing, HPDC 2020
Country/TerritorySweden
CityStockholm
Period06/23/2006/26/20

Funding

This research was supported in part by NSF grants 1525609, 1813004, and an appointment to the Oak Ridge National Laboratory ASTRO Program, sponsored by the U.S. Department of Energy and administered by the Oak Ridge Institute for Science and Education. This research was also supported by the Exascale Computing Project (17-SC-20-SC), a collaborative effort of the U.S. Department of Energy Office of Science and the National Nuclear Security Administration. This research used resources of the Oak Ridge Leadership Computing Facility at the Oak Ridge National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC05-00OR22725.

FundersFunder number
U.S. Department of Energy Office of Science
National Science Foundation1813004, 1525609
National Science Foundation
U.S. Department of EnergyDE-AC05-00OR22725
U.S. Department of Energy
Office of Science
National Nuclear Security Administration
Oak Ridge Institute for Science and Education17-SC-20-SC
Oak Ridge Institute for Science and Education

    Keywords

    • I/O subsystem
    • burst buffers
    • checkpoint/restart
    • failure prediction
    • high performance computing
    • proactive live migration

    Fingerprint

    Dive into the research topics of 'Orchestrating Fault Prediction with Live Migration and Checkpointing'. Together they form a unique fingerprint.

    Cite this