Reliability lessons learned from GPU experience with the Titan supercomputer at Oak Ridge leadership computing facility

Devesh Tiwari, Saurabh Gupta, George Gallarno, Jim Rogers, Don Maxwell

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

63 Scopus citations

Abstract

The high computational capability of graphics processing units (GPUs) is enabling and driving the scientific discovery process at large-scale. The world's second fastest supercomputer for open science, Titan, has more than 18,000 GPUs that computational scientists use to perform scientific simulations and data analysis. Understanding of GPU reliability characteristics, however, is still in its nascent stage since GPUs have only recently been deployed at large-scale. This paper presents a detailed study of GPU errors and their impact on system operations and applications, describing experiences with the 18,688 GPUs on the Titan supercomputer as well as lessons learned in the process of efficient operation of GPUs at scale. These experiences are helpful to HPC sites which already have large-scale GPU clusters or plan to deploy GPUs in the future.

Original languageEnglish
Title of host publicationProceedings of SC 2015
Subtitle of host publicationThe International Conference for High Performance Computing, Networking, Storage and Analysis
PublisherIEEE Computer Society
ISBN (Electronic)9781450337236
DOIs
StatePublished - Nov 15 2015
EventInternational Conference for High Performance Computing, Networking, Storage and Analysis, SC 2015 - Austin, United States
Duration: Nov 15 2015Nov 20 2015

Publication series

NameInternational Conference for High Performance Computing, Networking, Storage and Analysis, SC
Volume15-20-November-2015
ISSN (Print)2167-4329
ISSN (Electronic)2167-4337

Conference

ConferenceInternational Conference for High Performance Computing, Networking, Storage and Analysis, SC 2015
Country/TerritoryUnited States
CityAustin
Period11/15/1511/20/15

Funding

We thank the reviewers for their feedback that has significantly improved the paper. George Gallarno was supported by the U.S. Department of Energy, Office of Science, Office of Workforce Development for Teachers and Scientists (WDTS) under the SULI Program. This work was supported by the Oak Ridge Leadership Computing Facility at the Oak Ridge National Laboratory, which is managed by UT Battelle, LLC for the U.S. DOE (under the contract No. DE-AC05-00OR22725).

FundersFunder number
Oak Ridge National Laboratory
U.S. DOEDE-AC05-00OR22725
U.S. Department of Energy
Battelle
Office of Science
Workforce Development for Teachers and Scientists
Oak Ridge National Laboratory

    Fingerprint

    Dive into the research topics of 'Reliability lessons learned from GPU experience with the Titan supercomputer at Oak Ridge leadership computing facility'. Together they form a unique fingerprint.

    Cite this