Horseshoes and hand grenades: The case for approximate coordination in local checkpointing protocols

Patrick M. Widener, Kurt B. Ferreira, Scott Levy

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

2 Scopus citations

Abstract

Fault-tolerance poses a major challenge for future large-scale systems. Active research into coordinated, uncoordinated, and hybrid checkpointing systems has explored how the introduction of asynchrony can address anticipated scalability issues. While fully uncoordinated approaches have been shown to have significant delays, the degree of sychronization required to keep overheads low has not yet been significantly addressed. In this paper, we use a simulation-based approach to show the impact of synchronization on local checkpoint activity. Specifically, we show the degree of synchronization needed to keep the impacts of local checkpointing low is attainable with current technology for a number of key production HPC workloads. Our work provides a critical analysis and comparison of synchronization and local checkpointing. This enables users and system administrators to fine-tune the checkpointing scheme to the application and system characteristics available.

Original languageEnglish
Title of host publicationEuro-Par 2016
Subtitle of host publicationParallel Processing Workshops - Euro-Par 2016 International Workshops, Revised Selected Papers
EditorsPierre-Francois Dutot, Frederic Desprez
PublisherSpringer Verlag
Pages623-634
Number of pages12
ISBN (Print)9783319589428
DOIs
StatePublished - 2017
Externally publishedYes
Event22nd International Conference on Parallel and Distributed Computing, Euro-Par 2016 - Grenoble, France
Duration: Aug 24 2016Aug 26 2016

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume10104 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference22nd International Conference on Parallel and Distributed Computing, Euro-Par 2016
Country/TerritoryFrance
CityGrenoble
Period08/24/1608/26/16

Funding

Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly-owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energy?s National Nuclear Security Administration under contract DE-AC04-94AL85000. SAND2016-5027C.

FundersFunder number
U.S. Department of Energy?s National Nuclear Security AdministrationSAND2016-5027C, DE-AC04-94AL85000
Lockheed Martin Corporation

    Fingerprint

    Dive into the research topics of 'Horseshoes and hand grenades: The case for approximate coordination in local checkpointing protocols'. Together they form a unique fingerprint.

    Cite this