Towards new metrics for high-performance computing resilience

Saurabh Hukerikar, Rizwan A. Ashraf, Christian Engelmann

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

3 Scopus citations

Abstract

Ensuring the reliability of applications is becoming an increasingly important challenge as high-performance computing (HPC) systems experience an ever-growing number of faults, errors and failures. While the HPC community has made substantial progress in developing various resilience solutions, it continues to rely on platform-based metrics to quantify application resiliency improvements. The resilience of an HPC application is concerned with the reliability of the application outcome as well as the fault handling efficiency. To understand the scope of impact, effective coverage and performance efficiency of existing and emerging resilience solutions, there is a need for new metrics. In this paper, we develop new ways to quantify resilience that consider both the reliability and the performance characteristics of the solutions from the perspective of HPC applications. As HPC systems continue to evolve in terms of scale and complexity, it is expected that applications will experience various types of faults, errors and failures, which will require applications to apply multiple resilience solutions across the system stack. The proposed metrics are intended to be useful for understanding the combined impact of these solutions on an application's ability to produce correct results and to evaluate their overall impact on an application's performance in the presence of various modes of faults.

Original languageEnglish
Title of host publicationFTXS 2017 - Proceedings of the 2017 Workshop on Fault-Tolerance for HPC at Extreme Scale, co-located with HPDC 2017
PublisherAssociation for Computing Machinery, Inc
Pages23-30
Number of pages8
ISBN (Electronic)9781450350013
DOIs
StatePublished - Jun 26 2017
Event7th Fault Tolerance for HPC at eXtreme Scale Workshop, FTXS 2017 - Washington, United States
Duration: Jun 26 2017Jun 30 2017

Publication series

NameFTXS 2017 - Proceedings of the 2017 Workshop on Fault-Tolerance for HPC at Extreme Scale, co-located with HPDC 2017

Conference

Conference7th Fault Tolerance for HPC at eXtreme Scale Workshop, FTXS 2017
Country/TerritoryUnited States
CityWashington
Period06/26/1706/30/17

Funding

This material is based upon work supported by the U.S. Department of Energy, Office of Science, Office of Advanced Scientific Computing Research, program manager Lucy Nowell, under contract number DE-AC05-00OR22725.

Keywords

  • Fault tolerance
  • High performance computing
  • Metrics
  • Resilience

Fingerprint

Dive into the research topics of 'Towards new metrics for high-performance computing resilience'. Together they form a unique fingerprint.

Cite this