TY - GEN
T1 - A tunable holistic resiliency approach for high-performance computing systems
AU - Scott, Stephen L.
AU - Engelmann, Christian
AU - Vallée, Geoffroy R.
AU - Naughton, Thomas
AU - Tikotekar, Anand
AU - Ostrouchov, George
AU - Leangsuksun, Chokchai
AU - Naksinehaboon, Nichamonv
AU - Nassar, Raja
AU - Paun, Mihaela
AU - Mueller, Frank
AU - Wang, Chao
AU - Nagarajan, Arun B.
AU - Varma, Jyothish
PY - 2009
Y1 - 2009
N2 - In order to address anticipated high failure rates, resiliency characteristics have become an urgent priority for next-generation extreme-scale high-performance computing (HPC) systems. This poster describes our past and ongoing efforts in novel fault resilience technologies for HPC. Presented work includes proactive fault resilience techniques, system and application reliability models and analyses, failure prediction, transparent process- and virtual-machine-level migration, and trade-off models for combining preemptive migration with checkpoint/restart. This poster summarizes our work and puts all individual technologies into context with a proposed holistic fault resilience framework.
AB - In order to address anticipated high failure rates, resiliency characteristics have become an urgent priority for next-generation extreme-scale high-performance computing (HPC) systems. This poster describes our past and ongoing efforts in novel fault resilience technologies for HPC. Presented work includes proactive fault resilience techniques, system and application reliability models and analyses, failure prediction, transparent process- and virtual-machine-level migration, and trade-off models for combining preemptive migration with checkpoint/restart. This poster summarizes our work and puts all individual technologies into context with a proposed holistic fault resilience framework.
KW - Design
KW - Measurement
KW - Performance
KW - Reliability
UR - http://www.scopus.com/inward/record.url?scp=67650091156&partnerID=8YFLogxK
U2 - 10.1145/1504176.1504227
DO - 10.1145/1504176.1504227
M3 - Conference contribution
AN - SCOPUS:67650091156
SN - 9781605583976
T3 - Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP
SP - 305
EP - 306
BT - Proceedings of the 2009 ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP'09
T2 - 2009 ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP'09
Y2 - 14 February 2009 through 18 February 2009
ER -