Machine learning models for GPU error prediction in a large scale HPC system

Bin Nie, Ji Xue, Saurabh Gupta, Tirthak Patel, Christian Engelmann, Evgenia Smirni, Devesh Tiwari

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

59 Scopus citations

Abstract

GPUs are widely deployed on large-scale HPC systems to provide powerful computational capability for scientific applications from various domains. As those applications are normally long-running, investigating the characteristics of GPU errors becomes imperative for reliability. In this paper, we first study the system conditions that trigger GPU errors using six-month trace data collected from a large-scale, operational HPC system. Then, we use machine learning to predict the occurrence of GPU errors, by taking advantage of temporal and spatial dependencies of the trace data. The resulting machine learning prediction framework is robust and accurate under different workloads.

Original languageEnglish
Title of host publicationProceedings - 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2018
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages95-106
Number of pages12
ISBN (Electronic)9781538655955
DOIs
StatePublished - Jul 19 2018
Event48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2018 - Luxembourg City, Luxembourg
Duration: Jun 25 2018Jun 28 2018

Publication series

NameProceedings - 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2018

Conference

Conference48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2018
Country/TerritoryLuxembourg
CityLuxembourg City
Period06/25/1806/28/18

Bibliographical note

Publisher Copyright:
© 2018 IEEE.

Funding

Acknowledgment We thank reviewers for their constructive feedback. The work was supported by in part through NSF grants CCF-1649087, CCF-1717532, Northeastern University, and by the U.S. Department of Energy, Office of Science, Office of Advanced Scientific Computing Research, program manager Lucy Nowell. This work also used in part the resources of, the Oak Ridge Leadership Computing Facility, located in the National Center for Computational Sciences at ORNL, which is managed by UT Battelle, LLC for the U.S. DOE under contract number DE-AC05-00OR22725. This manuscript has been authored by UT-Battelle,LLC under Contract No. DE-AC05-00OR22725with the U.S. Department of Energy. The United States Government retains and the publisher, by accepting the article for publication, acknowledges that the United States Government retains a non-exclusive, paid-up, irrevocable, world-wide license to publish or reproduce the published form of this manuscript, or allow others to do so, for United States Government purposes. The Department of Energy will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan (http://energy.gov/downloads/doe-public-access-plan).

FundersFunder number
National Science Foundation1717532, CCF-1649087, CCF-1717532, 1649087
U.S. Department of EnergyDE-AC05-00OR22725
Office of Science
Advanced Scientific Computing Research
Northeastern University

    Keywords

    • Error Prediction
    • GPU Reliability
    • HPC
    • Machine Learning
    • System Reliability

    Fingerprint

    Dive into the research topics of 'Machine learning models for GPU error prediction in a large scale HPC system'. Together they form a unique fingerprint.

    Cite this