Proactive fault tolerance using preemptive migration

C. Engelmann, G. R. Vallée, T. Naughton, S. L. Scott

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

80 Scopus citations

Abstract

Proactive fault tolerance (FT) in high-performance computing is a concept that prevents compute node failures from impacting running parallel applications by preemptively migrating application parts away from nodes that are about to fail. This paper provides a foundation for proactive FT by defining its architecture and classifying implementation options. This paper further relates prior work to the presented architecture and classification, and discusses the challenges ahead for needed supporting technologies.

Original languageEnglish
Title of host publicationProceedings of the 17th Euromicro International Conference on Parallel, Distributed and Network-Based Processing, PDP 2009
Pages252-257
Number of pages6
DOIs
StatePublished - 2009
Event17th Euromicro International Conference on Parallel, Distributed and Network-Based Processing, PDP 2009 - Weimar, Germany
Duration: Feb 18 2009Feb 20 2009

Publication series

NameProceedings of the 17th Euromicro International Conference on Parallel, Distributed and Network-Based Processing, PDP 2009

Conference

Conference17th Euromicro International Conference on Parallel, Distributed and Network-Based Processing, PDP 2009
Country/TerritoryGermany
CityWeimar
Period02/18/0902/20/09

Fingerprint

Dive into the research topics of 'Proactive fault tolerance using preemptive migration'. Together they form a unique fingerprint.

Cite this