TY - GEN
T1 - Dynamic meta-learning for failure prediction in large-scale systems
T2 - 37th International Conference on Parallel Processing, ICPP 2008
AU - Gu, Jiexing
AU - Zheng, Ziming
AU - Lan, Zhiling
AU - White, John
AU - Hocks, Eva
AU - Park, Byung Hoon
PY - 2008
Y1 - 2008
N2 - Despite great efforts on the design of ultra-reliable components, the increase of system size and complexity has outpaced the improvement of component reliability. As a result, fault management becomes crucial in high performance computing. The advance of fault management relies on effective failure prediction. Despite years of research on failure prediction, it remains an open problem, especially in large-scale systems. In this paper, we address the problem by presenting a dynamic meta-learning prediction engine. It extends our previous work by exploring dynamic training, testing and prediction. Here, the "dynamic" part is from two perspectives: one is to continuously increase the training set during the system operation; and the other is to dynamically modify the rules of failure patterns by tracing prediction accuracy at runtime. Our case study indicates that the proposed predictor is promising by being capable of capturing more than 70% of failures, with the false alarm rate less than 10%.
AB - Despite great efforts on the design of ultra-reliable components, the increase of system size and complexity has outpaced the improvement of component reliability. As a result, fault management becomes crucial in high performance computing. The advance of fault management relies on effective failure prediction. Despite years of research on failure prediction, it remains an open problem, especially in large-scale systems. In this paper, we address the problem by presenting a dynamic meta-learning prediction engine. It extends our previous work by exploring dynamic training, testing and prediction. Here, the "dynamic" part is from two perspectives: one is to continuously increase the training set during the system operation; and the other is to dynamically modify the rules of failure patterns by tracing prediction accuracy at runtime. Our case study indicates that the proposed predictor is promising by being capable of capturing more than 70% of failures, with the false alarm rate less than 10%.
UR - http://www.scopus.com/inward/record.url?scp=55849147399&partnerID=8YFLogxK
U2 - 10.1109/ICPP.2008.17
DO - 10.1109/ICPP.2008.17
M3 - Conference contribution
AN - SCOPUS:55849147399
SN - 9780769533742
T3 - Proceedings of the International Conference on Parallel Processing
SP - 157
EP - 164
BT - Proceedings - 37th International Conference on Parallel Processing, ICPP 2008
Y2 - 9 September 2008 through 12 September 2008
ER -