TY - GEN
T1 - A proactive fault tolerance framework for high-performance computing
AU - Litvinova, Antonina
AU - Engelmann, Christian
AU - Scott, Stephen L.
PY - 2010
Y1 - 2010
N2 - As high-performance computing (HPC) systems continue to increase in scale, their mean-time to interrupt decreases respectively. The current state of practice for fault tolerance (FT) is checkpoint/restart. However, with increasing error rates, increasing aggregate memory and not proportionally increasing I/O capabilities, it is becoming less efficient. Proactive FT avoids experiencing failures through preventative measures, such as by migrating application parts away from nodes that are "about to fail". This paper presents a proactive FT framework that performs environmental monitoring, event logging, parallel job monitoring and resource monitoring to analyze HPC system reliability and to perform FT through such preventative actions.
AB - As high-performance computing (HPC) systems continue to increase in scale, their mean-time to interrupt decreases respectively. The current state of practice for fault tolerance (FT) is checkpoint/restart. However, with increasing error rates, increasing aggregate memory and not proportionally increasing I/O capabilities, it is becoming less efficient. Proactive FT avoids experiencing failures through preventative measures, such as by migrating application parts away from nodes that are "about to fail". This paper presents a proactive FT framework that performs environmental monitoring, event logging, parallel job monitoring and resource monitoring to analyze HPC system reliability and to perform FT through such preventative actions.
KW - Fault tolerance
KW - High availability
KW - High-performance computing
KW - Reliability
KW - System monitoring
UR - http://www.scopus.com/inward/record.url?scp=77954589337&partnerID=8YFLogxK
U2 - 10.2316/p.2010.676-024
DO - 10.2316/p.2010.676-024
M3 - Conference contribution
AN - SCOPUS:77954589337
SN - 9780889868205
T3 - Proceedings of the 9th IASTED International Conference on Parallel and Distributed Computing and Networks, PDCN 2010
SP - 105
EP - 110
BT - Proceedings of the 9th IASTED International Conference on Parallel and Distributed Computing and Networks, PDCN 2010
PB - Acta Press
T2 - 9th IASTED International Conference on Parallel and Distributed Computing and Networks, PDCN 2010
Y2 - 16 February 2010 through 18 February 2010
ER -