Gpu lifetimes on titan supercomputer: Survival analysis and reliability

George Ostrouchov, Don Maxwell, Rizwan A. Ashraf, Christian Engelmann, Mallikarjun Shankar, James H. Rogers

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

20 Scopus citations

Abstract

The Cray XK7 Titan was the top supercomputer system in the world for a long time and remained critically important throughout its nearly seven year life. It was an interesting machine from a reliability viewpoint as most of its power came from 18,688 GPUs whose operation was forced to execute three rework cycles, two on the GPU mechanical assembly and one on the GPU circuitboards. We write about the last rework cycle and a reliability analysis of over 100,000 years of GPU lifetimes during Titan's 6-year-long productive period. Using time between failures analysis and statistical survival analysis techniques, we find that GPU reliability is dependent on heat dissipation to an extent that strongly correlates with detailed nuances of the cooling architecture and job scheduling. We describe the history, data collection, cleaning, and analysis and give recommendations for future supercomputing systems. We make the data and our analysis codes publicly available.

Original languageEnglish
Title of host publicationProceedings of SC 2020
Subtitle of host publicationInternational Conference for High Performance Computing, Networking, Storage and Analysis
PublisherIEEE Computer Society
ISBN (Electronic)9781728199986
DOIs
StatePublished - Nov 2020
Event2020 International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2020 - Virtual, Atlanta, United States
Duration: Nov 9 2020Nov 19 2020

Publication series

NameInternational Conference for High Performance Computing, Networking, Storage and Analysis, SC
Volume2020-November
ISSN (Print)2167-4329
ISSN (Electronic)2167-4337

Conference

Conference2020 International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2020
Country/TerritoryUnited States
CityVirtual, Atlanta
Period11/9/2011/19/20

Funding

This work was sponsored by the U.S. Department of Energy’s Office of Advanced Scientific Computing Research. This manuscript has been authored by UT-Battelle, LLC under Contract No. DE-AC05-00OR22725 with the U.S. Department of Energy. The United States Government retains and the publisher, by accepting the article for publication, acknowledges that the United States Government retains a non-exclusive, paid-up, irrevocable, world-wide license to publish or reproduce the published form of this manuscript, or allow others to do so, for United States Government purposes. The Department of Energy will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan (http://energy.gov/downloads/doe-public-access-plan). This research used resources of the Oak Ridge Leadership Computing Facility at the Oak Ridge National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC05-00OR22725. This work was supported by the U.S. Department of Energy, Office of Science, Office of Advanced Scientific Computing Research, Resilience for Extreme Scale Supercomputing Systems Program, with program managers Robinson Pino and Lucy Nowell.

FundersFunder number
Advanced Scientific Computing ResearchDE-AC05-00OR22725
U.S. Department of Energy
U.S. Department of EnergyDE-AC05-00OR22725
Office of Science
Advanced Scientific Computing Research
U.S. Department of Energy
Office of Science

    Keywords

    • Cox regression
    • Cray
    • GPU
    • GPU failure data set
    • Kaplan-Meier survival
    • MTBF
    • NVIDIA
    • large-scale systems
    • log analysis
    • reliability
    • supercomputer

    Fingerprint

    Dive into the research topics of 'Gpu lifetimes on titan supercomputer: Survival analysis and reliability'. Together they form a unique fingerprint.

    Cite this