A large-scale study of soft-errors on GPUs in the field

Bin Nie, Devesh Tiwari, Saurabh Gupta, Evgenia Smirni, James H. Rogers

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

73 Scopus citations

Abstract

Parallelism provided by the GPU architecture has enabled domain scientists to simulate physical phenomena at a much faster rate and finer granularity than what was previously possible by CPU-based large-scale clusters. Architecture researchers have been investigating reliability characteristics of GPUs and innovating techniques to increase the reliability of these emerging computing devices. Such efforts are often guided by technology projections and simplistic scientific kernels, and performed using architectural simulators and modeling tools. Lack of large-scale field data impedes the effectiveness of such efforts. This study attempts to bridge this gap by presenting a large-scale field data analysis of GPU reliability. We characterize and quantify different kinds of soft-errors on the Titan supercomputer's GPU nodes. Our study uncovers several interesting and previously unknown insights about the characteristics and impact of soft-errors.

Original languageEnglish
Title of host publicationProceedings of the 2016 IEEE International Symposium on High-Performance Computer Architecture, HPCA 2016
PublisherIEEE Computer Society
Pages519-530
Number of pages12
ISBN (Electronic)9781467392112
DOIs
StatePublished - Apr 1 2016
Event22nd IEEE International Symposium on High Performance Computer Architecture, HPCA 2016 - Barcelona, Spain
Duration: Mar 12 2016Mar 16 2016

Publication series

NameProceedings - International Symposium on High-Performance Computer Architecture
Volume2016-April
ISSN (Print)1530-0897

Conference

Conference22nd IEEE International Symposium on High Performance Computer Architecture, HPCA 2016
Country/TerritorySpain
CityBarcelona
Period03/12/1603/16/16

Funding

This work also used the resources of the Oak Ridge Leadership Computing Facility at the Oak Ridge National Laboratory, which is managed by UT Battelle, LLC for the U.S. DOE (under the contract No. DE-AC05-00OR22725).

FundersFunder number
U.S. DOEDE-AC05-00OR22725
National Science Foundation1218758

    Fingerprint

    Dive into the research topics of 'A large-scale study of soft-errors on GPUs in the field'. Together they form a unique fingerprint.

    Cite this