Modeling the impact of checkpoints on next-generation systems

Ron A. Oldfield, Patricia J. Teller, Maria Ruiz Varela, Sarala Arunagiri, Seetharami Seelam, Rolf Riesen, Philip C. Roth

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

83 Scopus citations

Abstract

The next generation of capability-class, massively parallel processing (MPP) systems is expected to have hundreds of thousands of processors. For application-driven, periodic checkpoint operations, the state-of-the-art does not provide a solution that scales to next-generation systems. We demonstrate this by using mathematical modeling to compute a lower bound of the impact of these approaches on the performance of applications executed on three massive-scale, in-production, DOE systems and a theoretical petaflop system. We also adapt the model to investigate a proposed optimization that makes use of "lightweight" storage architectures and overlay networks to overcome the storage system bottleneck. Our results indicate that (1) as we approach the scale of next-generation systems, traditional checkpoint/restart approaches will increasingly impact application performance, accounting for over 50% of total application execution time; (2) although our alternative approach improves performance, it has limitations of its own; and (3) there is a critical need for new approaches to fault tolerance that allow continuous computing with minimal impact on application scalability.

Original languageEnglish
Title of host publicationProceedings - 24th IEEE Conference on Mass Storage Systems and Technologies, MSST 2007
Pages30-43
Number of pages14
DOIs
StatePublished - 2007
Event24th IEEE Conference on Mass Storage Systems and Technologies, MSST 2007 - San Diego, CA, United States
Duration: Sep 24 2007Sep 27 2007

Publication series

NameProceedings - 24th IEEE Conference on Mass Storage Systems and Technologies, MSST 2007

Conference

Conference24th IEEE Conference on Mass Storage Systems and Technologies, MSST 2007
Country/TerritoryUnited States
CitySan Diego, CA
Period09/24/0709/27/07

Fingerprint

Dive into the research topics of 'Modeling the impact of checkpoints on next-generation systems'. Together they form a unique fingerprint.

Cite this