Abstract
Fault-tolerance poses a major challenge for future large-scale systems. Active research into coordinated, uncoordinated, and hybrid checkpointing systems has explored how the introduction of asynchrony can address anticipated scalability issues. While fully uncoordinated approaches have been shown to have significant delays, the degree of sychronization required to keep overheads low has not yet been significantly addressed. In this paper, we use a simulation-based approach to show the impact of synchronization on local checkpoint activity. Specifically, we show the degree of synchronization needed to keep the impacts of local checkpointing low is attainable with current technology for a number of key production HPC workloads. Our work provides a critical analysis and comparison of synchronization and local checkpointing. This enables users and system administrators to fine-tune the checkpointing scheme to the application and system characteristics available.
Original language | English |
---|---|
Title of host publication | Euro-Par 2016 |
Subtitle of host publication | Parallel Processing Workshops - Euro-Par 2016 International Workshops, Revised Selected Papers |
Editors | Pierre-Francois Dutot, Frederic Desprez |
Publisher | Springer Verlag |
Pages | 623-634 |
Number of pages | 12 |
ISBN (Print) | 9783319589428 |
DOIs | |
State | Published - 2017 |
Externally published | Yes |
Event | 22nd International Conference on Parallel and Distributed Computing, Euro-Par 2016 - Grenoble, France Duration: Aug 24 2016 → Aug 26 2016 |
Publication series
Name | Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) |
---|---|
Volume | 10104 LNCS |
ISSN (Print) | 0302-9743 |
ISSN (Electronic) | 1611-3349 |
Conference
Conference | 22nd International Conference on Parallel and Distributed Computing, Euro-Par 2016 |
---|---|
Country/Territory | France |
City | Grenoble |
Period | 08/24/16 → 08/26/16 |
Funding
Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly-owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energy?s National Nuclear Security Administration under contract DE-AC04-94AL85000. SAND2016-5027C.