Dynamic meta-learning for failure prediction in large-scale systems: A case study

Jiexing Gu, Ziming Zheng, Zhiling Lan, John White, Eva Hocks, Byung Hoon Park

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

48 Scopus citations

Abstract

Despite great efforts on the design of ultra-reliable components, the increase of system size and complexity has outpaced the improvement of component reliability. As a result, fault management becomes crucial in high performance computing. The advance of fault management relies on effective failure prediction. Despite years of research on failure prediction, it remains an open problem, especially in large-scale systems. In this paper, we address the problem by presenting a dynamic meta-learning prediction engine. It extends our previous work by exploring dynamic training, testing and prediction. Here, the "dynamic" part is from two perspectives: one is to continuously increase the training set during the system operation; and the other is to dynamically modify the rules of failure patterns by tracing prediction accuracy at runtime. Our case study indicates that the proposed predictor is promising by being capable of capturing more than 70% of failures, with the false alarm rate less than 10%.

Original languageEnglish
Title of host publicationProceedings - 37th International Conference on Parallel Processing, ICPP 2008
Pages157-164
Number of pages8
DOIs
StatePublished - 2008
Event37th International Conference on Parallel Processing, ICPP 2008 - Portland, OR, United States
Duration: Sep 9 2008Sep 12 2008

Publication series

NameProceedings of the International Conference on Parallel Processing
ISSN (Print)0190-3918

Conference

Conference37th International Conference on Parallel Processing, ICPP 2008
Country/TerritoryUnited States
CityPortland, OR
Period09/9/0809/12/08

Fingerprint

Dive into the research topics of 'Dynamic meta-learning for failure prediction in large-scale systems: A case study'. Together they form a unique fingerprint.

Cite this