TY - JOUR
T1 - A tunable holistic resiliency approach for high-performance computing systems
AU - Scott, Stephen L.
AU - Engelmann, Christian
AU - Vallée, Geoffroy R.
AU - Naughton, Thomas
AU - Tikotekar, Anand
AU - Ostrouchov, George
AU - Leangsuksun, Chokchai
AU - Naksinehaboon, Nichamon
AU - Nassar, Raja
AU - Paun, Mihaela
AU - Mueller, Frank
AU - Wang, Chao
AU - Nagarajan, Arun B.
AU - Varma, Jyothish
PY - 2009
Y1 - 2009
N2 - In order to address anticipated high failure rates, resiliency characteristics have become an urgent priority for next-generation extreme-scale high-performance computing (HPC) systems. This poster describes our past and ongoing efforts in novel fault resilience technologies for HPC. Presented work includes proactive fault resilience techniques, system and application reliability models and analyses, failure prediction, transparent process-and virtual-machine-level migration, and trade-off models for combining preemptive migration with checkpoint/restart. This poster summarizes our work and puts all individual technologies into context with a proposed holistic fault resilience framework.
AB - In order to address anticipated high failure rates, resiliency characteristics have become an urgent priority for next-generation extreme-scale high-performance computing (HPC) systems. This poster describes our past and ongoing efforts in novel fault resilience technologies for HPC. Presented work includes proactive fault resilience techniques, system and application reliability models and analyses, failure prediction, transparent process-and virtual-machine-level migration, and trade-off models for combining preemptive migration with checkpoint/restart. This poster summarizes our work and puts all individual technologies into context with a proposed holistic fault resilience framework.
UR - http://www.scopus.com/inward/record.url?scp=70350602896&partnerID=8YFLogxK
U2 - 10.1145/1594835.1504227
DO - 10.1145/1594835.1504227
M3 - Article
AN - SCOPUS:70350602896
SN - 1523-2867
VL - 44
SP - 305
EP - 306
JO - ACM SIGPLAN Notices
JF - ACM SIGPLAN Notices
IS - 4
ER -