Fault injection framework for system resilience evaluation: Fake faults for finding future failures

Thomas Naughton, Wesley Bland, Geoffroy Vallée, Christian Engelmann, Stephen L. Scott

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

23 Scopus citations

Abstract

As high-performance computing (HPC) systems increase in size and complexity they become more difficult to manage. The enormous component counts associated with these large systems lead to significant challenges in system reliability and availability. This in turn is driving research into the resilience of large scale systems, which seeks to curb the effects of increased failures at large scales by masking the inevitable faults in these systems. The basic premise being that failure must be accepted as a reality of large scale system and coped with accordingly through system resilience. A key component in the development and evaluation of system resilience techniques is having a means to conduct controlled experiments. A common method for performing such experiments is to generate synthetic faults and study the resulting effects. In this paper we discuss the motivation and our initial use of software fault injection to support the evaluation of resilience for HPC systems. We mention background and related work in the area and discuss the design of a tool to aid in fault injection experiments for both user-space (application-level) and system-level failures.

Original languageEnglish
Title of host publicationProceedings of the 2009 Workshop on Resiliency in High Performance, Resilience'09, Co-located with the 2009 International Symposium on High Performance Distributed Computing Conference, HPDC'09
Pages23-28
Number of pages6
DOIs
StatePublished - 2009
Event2009 Workshop on Resiliency in High Performance, Resilience'09, Co-located with the 2009 International Symposium on High Performance Distributed Computing Conference, HPDC'09 - Garching, Germany
Duration: Jun 9 2009Jun 9 2009

Publication series

NameProceedings of the 2009 Workshop on Resiliency in High Performance, Resilience'09, Co-located with the 2009 International Symposium on High Performance Distributed Computing Conference, HPDC'09

Conference

Conference2009 Workshop on Resiliency in High Performance, Resilience'09, Co-located with the 2009 International Symposium on High Performance Distributed Computing Conference, HPDC'09
Country/TerritoryGermany
CityGarching
Period06/9/0906/9/09

Keywords

  • Fault injection
  • Resilience

Fingerprint

Dive into the research topics of 'Fault injection framework for system resilience evaluation: Fake faults for finding future failures'. Together they form a unique fingerprint.

Cite this