Blue gene/L log analysis and time to interrupt estimation

Narate Taerat, Nichamon Naksinehaboon, Clayton Chandler, James Elliott, Chokchai Leangsuksun, George Ostrouchov, Stephen L. Scott, Christian Engelmann

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

18 Scopus citations

Abstract

System- and application-level failures could be characterized by analyzing relevant log files. The resulting data might then be used in numerous studies on and future developments for the mission-critical and large scale computational architecture, including fields such as failure prediction, reliability modeling, performance modeling and power awareness. In this paper, system logs covering a six month period of the Blue Gene/L supercomputer were obtained and subsequently analyzed. Temporal filtering was applied to remove duplicated log messages. Optimistic and pessimistic perspectives were exerted on filtered log information to observe failure behavior within the system. Further, various time to repair factors were applied to obtain application time to interrupt, which will be exploited in further resilience modeling research.

Original languageEnglish
Title of host publicationProceedings - International Conference on Availability, Reliability and Security, ARES 2009
Pages173-180
Number of pages8
DOIs
StatePublished - 2009
EventInternational Conference on Availability, Reliability and Security, ARES 2009 - Fukuoka, Fukuoka Prefecture, Japan
Duration: Mar 16 2009Mar 19 2009

Publication series

NameProceedings - International Conference on Availability, Reliability and Security, ARES 2009

Conference

ConferenceInternational Conference on Availability, Reliability and Security, ARES 2009
Country/TerritoryJapan
CityFukuoka, Fukuoka Prefecture
Period03/16/0903/19/09

Fingerprint

Dive into the research topics of 'Blue gene/L log analysis and time to interrupt estimation'. Together they form a unique fingerprint.

Cite this