A proactive fault tolerance framework for high-performance computing

Antonina Litvinova, Christian Engelmann, Stephen L. Scott

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

13 Scopus citations

Abstract

As high-performance computing (HPC) systems continue to increase in scale, their mean-time to interrupt decreases respectively. The current state of practice for fault tolerance (FT) is checkpoint/restart. However, with increasing error rates, increasing aggregate memory and not proportionally increasing I/O capabilities, it is becoming less efficient. Proactive FT avoids experiencing failures through preventative measures, such as by migrating application parts away from nodes that are "about to fail". This paper presents a proactive FT framework that performs environmental monitoring, event logging, parallel job monitoring and resource monitoring to analyze HPC system reliability and to perform FT through such preventative actions.

Original languageEnglish
Title of host publicationProceedings of the 9th IASTED International Conference on Parallel and Distributed Computing and Networks, PDCN 2010
PublisherActa Press
Pages105-110
Number of pages6
ISBN (Print)9780889868205
DOIs
StatePublished - 2010
Event9th IASTED International Conference on Parallel and Distributed Computing and Networks, PDCN 2010 - Innsbruck, Austria
Duration: Feb 16 2010Feb 18 2010

Publication series

NameProceedings of the 9th IASTED International Conference on Parallel and Distributed Computing and Networks, PDCN 2010

Conference

Conference9th IASTED International Conference on Parallel and Distributed Computing and Networks, PDCN 2010
Country/TerritoryAustria
CityInnsbruck
Period02/16/1002/18/10

Keywords

  • Fault tolerance
  • High availability
  • High-performance computing
  • Reliability
  • System monitoring

Fingerprint

Dive into the research topics of 'A proactive fault tolerance framework for high-performance computing'. Together they form a unique fingerprint.

Cite this