Optimizing checkpoint data placement with guaranteed burst buffer endurance in large-scale hierarchical storage systems

Lipeng Wan, Qing Cao, Feiyi Wang, Sarp Oral

Research output: Contribution to journalArticlepeer-review

23 Scopus citations

Abstract

Non-volatile devices, such as SSDs, will be an integral part of the deepening storage hierarchy on large-scale HPC systems. These devices can be on the compute nodes as part of a distributed burst buffer service or they can be external. Wherever they are located in the hierarchy, one critical design issue is the SSD endurance under the write-heavy workloads, such as the checkpoint I/O for scientific applications. For these environments, it is widely assumed that checkpoint operations can occur once every 60 min and for each checkpoint step as much as half of the system memory can be written out. Unfortunately, for large-scale HPC applications, the burst buffer SSDs can be worn out much more quickly given the extensive amount of data written at every checkpoint step. One possible solution is to control the amount of data written by reducing the checkpoint frequency. However, a direct effect caused by reduced checkpoint frequency is the increased vulnerability window of system failures and therefore potentially wasted computation time, especially for large-scale compute jobs. In this paper, we propose a new checkpoint placement optimization model which collaboratively utilizes both the burst buffer and the parallel file system to store the checkpoints, with design goals of maximizing computation efficiency while guaranteeing the SSD endurance requirements. Moreover, we present an adaptive algorithm which can dynamically adjust the checkpoint placement based on the system's dynamic runtime characteristics and continuously optimize the burst buffer utilization. The evaluation results show that by using our adaptive checkpoint placement algorithm we can guarantee the burst buffer endurance with at most 5% performance degradation per application and less than 3% for the entire system.

Original languageEnglish
Pages (from-to)16-29
Number of pages14
JournalJournal of Parallel and Distributed Computing
Volume100
DOIs
StatePublished - Feb 1 2017

Funding

We would like to thank the reviewers for their insightful and inspiring comments. This research used resources of the Oak Ridge Leadership Computing Facility, located in the National Center for Computational Sciences at the Oak Ridge National Laboratory, which is supported by the Office of Science of the Department of Energy under Contract DE-AC05-00OR22725 . This work was also supported by NSF grant 0953238 .

Keywords

  • Burst buffer
  • Checkpoint
  • Fault tolerance
  • Hierarchical storage systems
  • Solid-state drive

Fingerprint

Dive into the research topics of 'Optimizing checkpoint data placement with guaranteed burst buffer endurance in large-scale hierarchical storage systems'. Together they form a unique fingerprint.

Cite this