Failure prediction for HPC systems and applications: Current situation and open issues

Ana Gainaru, Franck Cappello, Marc Snir, William Kramer

Research output: Contribution to journalArticlepeer-review

27 Scopus citations

Abstract

As large-scale systems evolve towards post-petascale computing, it is crucial to focus on providing fault-tolerance strategies that aim to minimize fault's effects on applications. By far the most popular technique is the checkpoint-restart strategy. A complement to this classical approach is failure avoidance, by which the occurrence of a fault is predicted and proactive measures are taken. This requires a reliable prediction system to anticipate failures and their locations. One way of offering prediction is by the analysis of system logs generated during production by large-scale systems. Current research in this field presents a number of limitations that make them unusable for running on real production high-performance computing (HPC) systems. Based on our observations that different failures have different distributions and behaviours, we propose a novel hybrid approach that combines signal analysis with data mining in order to overcome current limitations. We show that by analysing each event according to its specific behaviour, our prediction provides a precision of over 90% and its able to discover about 50% of all failures in a system, result which allows its integration in proactive fault tolerance protocols.

Original languageEnglish
Pages (from-to)273-282
Number of pages10
JournalInternational Journal of High Performance Computing Applications
Volume27
Issue number3
DOIs
StatePublished - Aug 2013
Externally publishedYes

Funding

This research is part of the Blue Waters sustained-petascale computing project, which is supported by the National Science Foundation (award number OCI 07-25070) and the state of Illinois. Blue Waters is a joint effort of the University of Illinois at Urbana-Champaign and its National Center for Supercomputing Applications. This research was done in the context of the INRIA-Illinois Joint Laboratory for Petascale Computing. This work was also supported by the U.S. Department of Energy, Office of Science, under Contract No. DE-AC02-06CH11357.

FundersFunder number
National Science FoundationOCI 07-25070
U.S. Department of Energy
Office of ScienceDE-AC02-06CH11357

    Keywords

    • failure prediction
    • fault tolerance
    • signal analysis

    Fingerprint

    Dive into the research topics of 'Failure prediction for HPC systems and applications: Current situation and open issues'. Together they form a unique fingerprint.

    Cite this