Accurate fault prediction of Blue Gene/P RAS logs via geometric reduction

Joshua Thompson, David W. Dreisigmeyer, Terry Jones, Michael Kirby, Joshua Ladd

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

9 Scopus citations

Abstract

This investigation presents two distinct and novel approaches for the prediction of system failures occurring in Oak Ridge National Laboratory's Blue Gene/P supercomputer. Each technique uses raw numeric and textual subsets of large data logs of physical system information such as fan speeds and CPU temperatures. This data is used to develop models of the system capable of sensing anomalies, or deviations from nominal behavior. Each algorithm predicted event log reported anomalies in advance of their occurrence and one algorithm did so without false positives. Both algorithms predicted an anomaly that did not appear in the event log. It was later learned that the fault missing from the log but predicted by both algorithms was confirmed to have occurred by the system administrator.

Original languageEnglish
Title of host publication2010 International Conference on Dependable Systems and Networks Workshops, DSN-W 2010
Pages8-14
Number of pages7
DOIs
StatePublished - 2010
Event2010 International Conference on Dependable Systems and Networks Workshops, DSN-W 2010 - Chicago, IL, United States
Duration: Jun 28 2010Jul 1 2010

Publication series

NameProceedings of the International Conference on Dependable Systems and Networks

Conference

Conference2010 International Conference on Dependable Systems and Networks Workshops, DSN-W 2010
Country/TerritoryUnited States
CityChicago, IL
Period06/28/1007/1/10

Keywords

  • Fault prediction
  • High performance computing
  • MSET
  • NMF
  • Resiliency

Fingerprint

Dive into the research topics of 'Accurate fault prediction of Blue Gene/P RAS logs via geometric reduction'. Together they form a unique fingerprint.

Cite this