Abstract
GPUs are widely deployed on large-scale HPC systems to provide powerful computational capability for scientific applications from various domains. As those applications are normally long-running, investigating the characteristics of GPU errors becomes imperative for reliability. In this paper, we first study the system conditions that trigger GPU errors using six-month trace data collected from a large-scale, operational HPC system. Then, we use machine learning to predict the occurrence of GPU errors, by taking advantage of temporal and spatial dependencies of the trace data. The resulting machine learning prediction framework is robust and accurate under different workloads.
Original language | English |
---|---|
Title of host publication | Proceedings - 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2018 |
Publisher | Institute of Electrical and Electronics Engineers Inc. |
Pages | 95-106 |
Number of pages | 12 |
ISBN (Electronic) | 9781538655955 |
DOIs | |
State | Published - Jul 19 2018 |
Event | 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2018 - Luxembourg City, Luxembourg Duration: Jun 25 2018 → Jun 28 2018 |
Publication series
Name | Proceedings - 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2018 |
---|
Conference
Conference | 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2018 |
---|---|
Country/Territory | Luxembourg |
City | Luxembourg City |
Period | 06/25/18 → 06/28/18 |
Bibliographical note
Publisher Copyright:© 2018 IEEE.
Funding
Acknowledgment We thank reviewers for their constructive feedback. The work was supported by in part through NSF grants CCF-1649087, CCF-1717532, Northeastern University, and by the U.S. Department of Energy, Office of Science, Office of Advanced Scientific Computing Research, program manager Lucy Nowell. This work also used in part the resources of, the Oak Ridge Leadership Computing Facility, located in the National Center for Computational Sciences at ORNL, which is managed by UT Battelle, LLC for the U.S. DOE under contract number DE-AC05-00OR22725. This manuscript has been authored by UT-Battelle,LLC under Contract No. DE-AC05-00OR22725with the U.S. Department of Energy. The United States Government retains and the publisher, by accepting the article for publication, acknowledges that the United States Government retains a non-exclusive, paid-up, irrevocable, world-wide license to publish or reproduce the published form of this manuscript, or allow others to do so, for United States Government purposes. The Department of Energy will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan (http://energy.gov/downloads/doe-public-access-plan).
Funders | Funder number |
---|---|
National Science Foundation | 1717532, CCF-1649087, CCF-1717532, 1649087 |
U.S. Department of Energy | DE-AC05-00OR22725 |
Office of Science | |
Advanced Scientific Computing Research | |
Northeastern University |
Keywords
- Error Prediction
- GPU Reliability
- HPC
- Machine Learning
- System Reliability