Abstract
Parallelism provided by the GPU architecture has enabled domain scientists to simulate physical phenomena at a much faster rate and finer granularity than what was previously possible by CPU-based large-scale clusters. Architecture researchers have been investigating reliability characteristics of GPUs and innovating techniques to increase the reliability of these emerging computing devices. Such efforts are often guided by technology projections and simplistic scientific kernels, and performed using architectural simulators and modeling tools. Lack of large-scale field data impedes the effectiveness of such efforts. This study attempts to bridge this gap by presenting a large-scale field data analysis of GPU reliability. We characterize and quantify different kinds of soft-errors on the Titan supercomputer's GPU nodes. Our study uncovers several interesting and previously unknown insights about the characteristics and impact of soft-errors.
Original language | English |
---|---|
Title of host publication | Proceedings of the 2016 IEEE International Symposium on High-Performance Computer Architecture, HPCA 2016 |
Publisher | IEEE Computer Society |
Pages | 519-530 |
Number of pages | 12 |
ISBN (Electronic) | 9781467392112 |
DOIs | |
State | Published - Apr 1 2016 |
Event | 22nd IEEE International Symposium on High Performance Computer Architecture, HPCA 2016 - Barcelona, Spain Duration: Mar 12 2016 → Mar 16 2016 |
Publication series
Name | Proceedings - International Symposium on High-Performance Computer Architecture |
---|---|
Volume | 2016-April |
ISSN (Print) | 1530-0897 |
Conference
Conference | 22nd IEEE International Symposium on High Performance Computer Architecture, HPCA 2016 |
---|---|
Country/Territory | Spain |
City | Barcelona |
Period | 03/12/16 → 03/16/16 |
Funding
This work also used the resources of the Oak Ridge Leadership Computing Facility at the Oak Ridge National Laboratory, which is managed by UT Battelle, LLC for the U.S. DOE (under the contract No. DE-AC05-00OR22725).