Asking the right questions: Benchmarking fault-tolerant extreme-scale systems

Patrick M. Widener, Kurt B. Ferreira, Scott Levy, Patrick G. Bridges, Dorian Arnold, Ron Brightwell

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

1 Scopus citations

Abstract

Much recent research has explored fault-tolerance mechanisms intended for current and future extreme-scale systems. Evaluations of the suitability of checkpoint-based solutions have typically been carried out using relatively uncomplicated computational kernels designed to measure floating point performance. More recent investigations have added scaled-down "proxy" applications to more closely match the composition and behavior of deployed ones. However, the information obtained from these studies (whether floating point performance or application runtime) is not necessarily of the most value in evaluating resilience strategies. We observe that even when using a more sophisticated metric, the information available from evaluating uncoordinated checkpointing using both microbenchmarks and proxy applications does not agree. This implies that not only might researchers be asking the wrong questions, but that the answers to the right ones might be unexpected and potentially misleading. We seek to open a discussion on whether benchmarks designed to provide predictable performance evaluations of HPC hardware and toolchains are providing the right feedback for the evaluation of fault-tolerance in these applications, and more generally on how benchmarking of resilience mechanisms ought to be approached in the exascale design space.

Original languageEnglish
Title of host publicationEuro-Par 2013
Subtitle of host publicationParallel Processing Workshops - BigDataCloud, DIHC, FedICI, HeteroPar, HiBB, LSDVE, MHPC, OMHI, PADABS, PROPER, Resilience, ROME, and UCHPC 2013, Revised Selected Papers
PublisherSpringer Verlag
Pages717-726
Number of pages10
ISBN (Print)9783642544194
DOIs
StatePublished - 2014
Externally publishedYes
Event19th International Conference on Parallel Processing Workshops, Euro-Par 2013 - BigDataCloud, DIHC, FedICI, HeteroPar, HiBB, LSDVE, MHPC, OMHI, PADABS, PROPER, Resilience, ROME, and UCHPC 2013 - Aachen, Germany
Duration: Aug 26 2013Aug 27 2013

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume8374 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference19th International Conference on Parallel Processing Workshops, Euro-Par 2013 - BigDataCloud, DIHC, FedICI, HeteroPar, HiBB, LSDVE, MHPC, OMHI, PADABS, PROPER, Resilience, ROME, and UCHPC 2013
Country/TerritoryGermany
CityAachen
Period08/26/1308/27/13

Fingerprint

Dive into the research topics of 'Asking the right questions: Benchmarking fault-tolerant extreme-scale systems'. Together they form a unique fingerprint.

Cite this