Nonparametric multivariate anomaly analysis in support of HPC resilience

G. Ostrouchov, T. Naughton, C. Engelmann, G. Vallée, S. L. Scott

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

4 Scopus citations

Abstract

Large-scale computing systems provide great potential for scientific exploration. However, the complexity that accompanies these enormous machines raises challeges for both, users and operators. The effective use of such systems is often hampered by failures encountered when running applications on systems containing tens-of-thousands of nodes and hundredsof-thousands of compute cores capable of yielding petaflops of performance. In systems of this size failure detection is complicated and root-cause diagnosis difficult. This paper describes our recent work in the identification of anomalies in monitoring data and system logs to provide further insights into machine status, runtime behavior, failure modes and failure root causes. It discusses the details of an initial prototype that gathers the data and uses statistical techniques for analysis.

Original languageEnglish
Title of host publicatione-science 2009 - Proceedings of the 2009 5th IEEE International Conference on e-Science Workshops
Pages80-85
Number of pages6
DOIs
StatePublished - 2009
Event2009 5th IEEE International Conference on e-Science Workshops, e-science 2009 - Oxford, United Kingdom
Duration: Dec 9 2009Dec 11 2009

Publication series

Namee-science 2009 - Proceedings of the 2009 5th IEEE International Conference on e-Science Workshops

Conference

Conference2009 5th IEEE International Conference on e-Science Workshops, e-science 2009
Country/TerritoryUnited Kingdom
CityOxford
Period12/9/0912/11/09

Fingerprint

Dive into the research topics of 'Nonparametric multivariate anomaly analysis in support of HPC resilience'. Together they form a unique fingerprint.

Cite this