A tunable holistic resiliency approach for high-performance computing systems

Stephen L. Scott, Christian Engelmann, Geoffroy R. Vallée, Thomas Naughton, Anand Tikotekar, George Ostrouchov, Chokchai Leangsuksun, Nichamonv Naksinehaboon, Raja Nassar, Mihaela Paun, Frank Mueller, Chao Wang, Arun B. Nagarajan, Jyothish Varma

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

5 Scopus citations

Abstract

In order to address anticipated high failure rates, resiliency characteristics have become an urgent priority for next-generation extreme-scale high-performance computing (HPC) systems. This poster describes our past and ongoing efforts in novel fault resilience technologies for HPC. Presented work includes proactive fault resilience techniques, system and application reliability models and analyses, failure prediction, transparent process- and virtual-machine-level migration, and trade-off models for combining preemptive migration with checkpoint/restart. This poster summarizes our work and puts all individual technologies into context with a proposed holistic fault resilience framework.

Original languageEnglish
Title of host publicationProceedings of the 2009 ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP'09
Pages305-306
Number of pages2
DOIs
StatePublished - 2009
Event2009 ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP'09 - Raleigh, NC, United States
Duration: Feb 14 2009Feb 18 2009

Publication series

NameProceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP

Conference

Conference2009 ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP'09
Country/TerritoryUnited States
CityRaleigh, NC
Period02/14/0902/18/09

Keywords

  • Design
  • Measurement
  • Performance
  • Reliability

Fingerprint

Dive into the research topics of 'A tunable holistic resiliency approach for high-performance computing systems'. Together they form a unique fingerprint.

Cite this