TY - GEN
T1 - Blue gene/L log analysis and time to interrupt estimation
AU - Taerat, Narate
AU - Naksinehaboon, Nichamon
AU - Chandler, Clayton
AU - Elliott, James
AU - Leangsuksun, Chokchai
AU - Ostrouchov, George
AU - Scott, Stephen L.
AU - Engelmann, Christian
PY - 2009
Y1 - 2009
N2 - System- and application-level failures could be characterized by analyzing relevant log files. The resulting data might then be used in numerous studies on and future developments for the mission-critical and large scale computational architecture, including fields such as failure prediction, reliability modeling, performance modeling and power awareness. In this paper, system logs covering a six month period of the Blue Gene/L supercomputer were obtained and subsequently analyzed. Temporal filtering was applied to remove duplicated log messages. Optimistic and pessimistic perspectives were exerted on filtered log information to observe failure behavior within the system. Further, various time to repair factors were applied to obtain application time to interrupt, which will be exploited in further resilience modeling research.
AB - System- and application-level failures could be characterized by analyzing relevant log files. The resulting data might then be used in numerous studies on and future developments for the mission-critical and large scale computational architecture, including fields such as failure prediction, reliability modeling, performance modeling and power awareness. In this paper, system logs covering a six month period of the Blue Gene/L supercomputer were obtained and subsequently analyzed. Temporal filtering was applied to remove duplicated log messages. Optimistic and pessimistic perspectives were exerted on filtered log information to observe failure behavior within the system. Further, various time to repair factors were applied to obtain application time to interrupt, which will be exploited in further resilience modeling research.
UR - http://www.scopus.com/inward/record.url?scp=70349657128&partnerID=8YFLogxK
U2 - 10.1109/ARES.2009.105
DO - 10.1109/ARES.2009.105
M3 - Conference contribution
AN - SCOPUS:70349657128
SN - 9780769535647
T3 - Proceedings - International Conference on Availability, Reliability and Security, ARES 2009
SP - 173
EP - 180
BT - Proceedings - International Conference on Availability, Reliability and Security, ARES 2009
T2 - International Conference on Availability, Reliability and Security, ARES 2009
Y2 - 16 March 2009 through 19 March 2009
ER -