Optimal cooperative checkpointing for shared high-performance computing platforms

Thomas Herault, Yves Robert, Aurelien Bouteiller, Dorian Arnold, Kurt Ferreira, George Bosilca, Jack Dongarra

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

5 Scopus citations

Abstract

In high-performance computing environments, input/output (I/O) from various sources often contend for scarce available bandwidth. Adding to the I/O operations inherent to the failure-free execution of an application, I/O from checkpoint/restart (CR) operations (used to ensure progress in the presence of failures) place an additional burden as it increase I/O contention, leading to degraded performance. In this work, we consider a cooperative scheduling policy that optimizes the overall performance of concurrently executing CR-based applications which share valuable I/O resources. First, we provide a theoretical model and then derive a set of necessary constraints needed to minimize the global waste on the platform. Our results demonstrate that the optimal checkpoint interval, as defined by Young/Daly, despite providing a sensible metric for a single application, is not sufficient to optimally address resource contention at the platform scale. We therefore show that combining optimal checkpointing periods with I/O scheduling strategies can provide a significant improvement on the overall application performance, thereby maximizing platform throughput. Overall, these results provide critical analysis and direct guidance on checkpointing large-scale workloads in the presence of competing I/O while minimizing the impact on application performance.

Original languageEnglish
Title of host publicationProceedings - 2018 IEEE 32nd International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2018
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages803-812
Number of pages10
ISBN (Print)9781538655559
DOIs
StatePublished - Aug 3 2018
Externally publishedYes
Event32nd IEEE International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2018 - Vancouver, Canada
Duration: May 21 2018May 25 2018

Publication series

NameProceedings - 2018 IEEE 32nd International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2018

Conference

Conference32nd IEEE International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2018
Country/TerritoryCanada
CityVancouver
Period05/21/1805/25/18

Keywords

  • Checkpoint
  • Cooperative strategy
  • I/O contention
  • Workflow

Fingerprint

Dive into the research topics of 'Optimal cooperative checkpointing for shared high-performance computing platforms'. Together they form a unique fingerprint.

Cite this