TY - GEN
T1 - 3-Dimensional root cause diagnosis via co-analysis
AU - Zheng, Ziming
AU - Yu, Li
AU - Lan, Zhiling
AU - Jones, Terry
PY - 2012
Y1 - 2012
N2 - With the growth of system size and complexity, reliability has become a major concern for large-scale systems. Upon the occurrence of failure, system administrators typically trace the events in Reliability, Availability, and Serviceability (RAS) logs for root cause diagnosis. However, RAS log only contains limited diagnosis information. Moreover, the manual processing is time-consuming, error-prone, and not scalable. To address the problem, in this paper we present an automated root cause diagnosis mechanism for large-scale HPC systems. Our mechanism examines multiple logs to provide a 3-D fine-grained root cause analysis. Here, 3-D means that our analysis will pinpoint the failure layer, the time, and the location of the event that causes the problem. We evaluate our mechanism by means of real logs collected from a production IBM Blue Gene/P system at Oak Ridge National Laboratory. It successfully identifies failure layer information for the failures during 23-month period. Furthermore, it effectively identifies the triggering events with time and location information, even when the triggering events occur hundreds of hours before the resulting failures.
AB - With the growth of system size and complexity, reliability has become a major concern for large-scale systems. Upon the occurrence of failure, system administrators typically trace the events in Reliability, Availability, and Serviceability (RAS) logs for root cause diagnosis. However, RAS log only contains limited diagnosis information. Moreover, the manual processing is time-consuming, error-prone, and not scalable. To address the problem, in this paper we present an automated root cause diagnosis mechanism for large-scale HPC systems. Our mechanism examines multiple logs to provide a 3-D fine-grained root cause analysis. Here, 3-D means that our analysis will pinpoint the failure layer, the time, and the location of the event that causes the problem. We evaluate our mechanism by means of real logs collected from a production IBM Blue Gene/P system at Oak Ridge National Laboratory. It successfully identifies failure layer information for the failures during 23-month period. Furthermore, it effectively identifies the triggering events with time and location information, even when the triggering events occur hundreds of hours before the resulting failures.
KW - Co-analysis
KW - Diagnosis
KW - Large-scale system
UR - http://www.scopus.com/inward/record.url?scp=84867695274&partnerID=8YFLogxK
U2 - 10.1145/2371536.2371571
DO - 10.1145/2371536.2371571
M3 - Conference contribution
AN - SCOPUS:84867695274
SN - 9781450315203
T3 - ICAC'12 - Proceedings of the 9th ACM International Conference on Autonomic Computing
SP - 181
EP - 190
BT - ICAC'12 - Proceedings of the 9th ACM International Conference on Autonomic Computing
T2 - 9th ACM International Conference on Autonomic Computing, ICAC'12
Y2 - 18 September 2012 through 20 September 2012
ER -