Abstract
Checkpoint/Restart (C/R) is widely used to provide fault tolerance on High-Performance Computing (HPC) systems. However, Parallel File System (PFS) overhead and failure uncertainty cause significant application overhead. This paper develops an adaptive multi-level C/R model that incorporates a failure prediction and analysis model, which orchestrates failure prediction, checkpointing, checkpoint frequency, and proactive live migration along with the additional benefit of Burst Buffers (BB). It effectively reduces the overheads due to failures, checkpointing, and recovery. Simulation results for the Summit supercomputer yield a reduction of ∼20%-86% in application overhead due to BBs, orchestrated failure prediction, and migration. We also observe a ∼29% decrease in checkpoint writes to BBs, which can increase the longevity of the BB storage devices.
Original language | English |
---|---|
Title of host publication | HPDC 2020 - Proceedings of the 29th International Symposium on High-Performance Parallel and Distributed Computing |
Publisher | Association for Computing Machinery, Inc |
Pages | 167-171 |
Number of pages | 5 |
ISBN (Electronic) | 9781450370523 |
DOIs | |
State | Published - Jun 23 2020 |
Event | 29th International Symposium on High-Performance Parallel and Distributed Computing, HPDC 2020 - Stockholm, Sweden Duration: Jun 23 2020 → Jun 26 2020 |
Publication series
Name | HPDC 2020 - Proceedings of the 29th International Symposium on High-Performance Parallel and Distributed Computing |
---|
Conference
Conference | 29th International Symposium on High-Performance Parallel and Distributed Computing, HPDC 2020 |
---|---|
Country/Territory | Sweden |
City | Stockholm |
Period | 06/23/20 → 06/26/20 |
Funding
This research was supported in part by NSF grants 1525609, 1813004, and an appointment to the Oak Ridge National Laboratory ASTRO Program, sponsored by the U.S. Department of Energy and administered by the Oak Ridge Institute for Science and Education. This research was also supported by the Exascale Computing Project (17-SC-20-SC), a collaborative effort of the U.S. Department of Energy Office of Science and the National Nuclear Security Administration. This research used resources of the Oak Ridge Leadership Computing Facility at the Oak Ridge National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC05-00OR22725.
Keywords
- I/O subsystem
- burst buffers
- checkpoint/restart
- failure prediction
- high performance computing
- proactive live migration