TY - GEN
T1 - A large-scale study of soft-errors on GPUs in the field
AU - Nie, Bin
AU - Tiwari, Devesh
AU - Gupta, Saurabh
AU - Smirni, Evgenia
AU - Rogers, James H.
N1 - Publisher Copyright:
© 2016 IEEE.
PY - 2016/4/1
Y1 - 2016/4/1
N2 - Parallelism provided by the GPU architecture has enabled domain scientists to simulate physical phenomena at a much faster rate and finer granularity than what was previously possible by CPU-based large-scale clusters. Architecture researchers have been investigating reliability characteristics of GPUs and innovating techniques to increase the reliability of these emerging computing devices. Such efforts are often guided by technology projections and simplistic scientific kernels, and performed using architectural simulators and modeling tools. Lack of large-scale field data impedes the effectiveness of such efforts. This study attempts to bridge this gap by presenting a large-scale field data analysis of GPU reliability. We characterize and quantify different kinds of soft-errors on the Titan supercomputer's GPU nodes. Our study uncovers several interesting and previously unknown insights about the characteristics and impact of soft-errors.
AB - Parallelism provided by the GPU architecture has enabled domain scientists to simulate physical phenomena at a much faster rate and finer granularity than what was previously possible by CPU-based large-scale clusters. Architecture researchers have been investigating reliability characteristics of GPUs and innovating techniques to increase the reliability of these emerging computing devices. Such efforts are often guided by technology projections and simplistic scientific kernels, and performed using architectural simulators and modeling tools. Lack of large-scale field data impedes the effectiveness of such efforts. This study attempts to bridge this gap by presenting a large-scale field data analysis of GPU reliability. We characterize and quantify different kinds of soft-errors on the Titan supercomputer's GPU nodes. Our study uncovers several interesting and previously unknown insights about the characteristics and impact of soft-errors.
UR - http://www.scopus.com/inward/record.url?scp=84965011474&partnerID=8YFLogxK
U2 - 10.1109/HPCA.2016.7446091
DO - 10.1109/HPCA.2016.7446091
M3 - Conference contribution
AN - SCOPUS:84965011474
T3 - Proceedings - International Symposium on High-Performance Computer Architecture
SP - 519
EP - 530
BT - Proceedings of the 2016 IEEE International Symposium on High-Performance Computer Architecture, HPCA 2016
PB - IEEE Computer Society
T2 - 22nd IEEE International Symposium on High Performance Computer Architecture, HPCA 2016
Y2 - 12 March 2016 through 16 March 2016
ER -