TY - GEN
T1 - Machine learning models for GPU error prediction in a large scale HPC system
AU - Nie, Bin
AU - Xue, Ji
AU - Gupta, Saurabh
AU - Patel, Tirthak
AU - Engelmann, Christian
AU - Smirni, Evgenia
AU - Tiwari, Devesh
N1 - Publisher Copyright:
© 2018 IEEE.
PY - 2018/7/19
Y1 - 2018/7/19
N2 - GPUs are widely deployed on large-scale HPC systems to provide powerful computational capability for scientific applications from various domains. As those applications are normally long-running, investigating the characteristics of GPU errors becomes imperative for reliability. In this paper, we first study the system conditions that trigger GPU errors using six-month trace data collected from a large-scale, operational HPC system. Then, we use machine learning to predict the occurrence of GPU errors, by taking advantage of temporal and spatial dependencies of the trace data. The resulting machine learning prediction framework is robust and accurate under different workloads.
AB - GPUs are widely deployed on large-scale HPC systems to provide powerful computational capability for scientific applications from various domains. As those applications are normally long-running, investigating the characteristics of GPU errors becomes imperative for reliability. In this paper, we first study the system conditions that trigger GPU errors using six-month trace data collected from a large-scale, operational HPC system. Then, we use machine learning to predict the occurrence of GPU errors, by taking advantage of temporal and spatial dependencies of the trace data. The resulting machine learning prediction framework is robust and accurate under different workloads.
KW - Error Prediction
KW - GPU Reliability
KW - HPC
KW - Machine Learning
KW - System Reliability
UR - http://www.scopus.com/inward/record.url?scp=85051090091&partnerID=8YFLogxK
U2 - 10.1109/DSN.2018.00022
DO - 10.1109/DSN.2018.00022
M3 - Conference contribution
AN - SCOPUS:85051090091
T3 - Proceedings - 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2018
SP - 95
EP - 106
BT - Proceedings - 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2018
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2018
Y2 - 25 June 2018 through 28 June 2018
ER -