TY - GEN
T1 - Orchestrating Fault Prediction with Live Migration and Checkpointing
AU - Behera, Subhendu
AU - Wan, Lipeng
AU - Mueller, Frank
AU - Wolf, Matthew
AU - Klasky, Scott
N1 - Publisher Copyright:
© 2020 ACM.
PY - 2020/6/23
Y1 - 2020/6/23
N2 - Checkpoint/Restart (C/R) is widely used to provide fault tolerance on High-Performance Computing (HPC) systems. However, Parallel File System (PFS) overhead and failure uncertainty cause significant application overhead. This paper develops an adaptive multi-level C/R model that incorporates a failure prediction and analysis model, which orchestrates failure prediction, checkpointing, checkpoint frequency, and proactive live migration along with the additional benefit of Burst Buffers (BB). It effectively reduces the overheads due to failures, checkpointing, and recovery. Simulation results for the Summit supercomputer yield a reduction of ∼20%-86% in application overhead due to BBs, orchestrated failure prediction, and migration. We also observe a ∼29% decrease in checkpoint writes to BBs, which can increase the longevity of the BB storage devices.
AB - Checkpoint/Restart (C/R) is widely used to provide fault tolerance on High-Performance Computing (HPC) systems. However, Parallel File System (PFS) overhead and failure uncertainty cause significant application overhead. This paper develops an adaptive multi-level C/R model that incorporates a failure prediction and analysis model, which orchestrates failure prediction, checkpointing, checkpoint frequency, and proactive live migration along with the additional benefit of Burst Buffers (BB). It effectively reduces the overheads due to failures, checkpointing, and recovery. Simulation results for the Summit supercomputer yield a reduction of ∼20%-86% in application overhead due to BBs, orchestrated failure prediction, and migration. We also observe a ∼29% decrease in checkpoint writes to BBs, which can increase the longevity of the BB storage devices.
KW - I/O subsystem
KW - burst buffers
KW - checkpoint/restart
KW - failure prediction
KW - high performance computing
KW - proactive live migration
UR - http://www.scopus.com/inward/record.url?scp=85088359263&partnerID=8YFLogxK
U2 - 10.1145/3369583.3392672
DO - 10.1145/3369583.3392672
M3 - Conference contribution
AN - SCOPUS:85088359263
T3 - HPDC 2020 - Proceedings of the 29th International Symposium on High-Performance Parallel and Distributed Computing
SP - 167
EP - 171
BT - HPDC 2020 - Proceedings of the 29th International Symposium on High-Performance Parallel and Distributed Computing
PB - Association for Computing Machinery, Inc
T2 - 29th International Symposium on High-Performance Parallel and Distributed Computing, HPDC 2020
Y2 - 23 June 2020 through 26 June 2020
ER -