Abstract
The Cray XK7 Titan was the top supercomputer system in the world for a long time and remained critically important throughout its nearly seven year life. It was an interesting machine from a reliability viewpoint as most of its power came from 18,688 GPUs whose operation was forced to execute three rework cycles, two on the GPU mechanical assembly and one on the GPU circuitboards. We write about the last rework cycle and a reliability analysis of over 100,000 years of GPU lifetimes during Titan's 6-year-long productive period. Using time between failures analysis and statistical survival analysis techniques, we find that GPU reliability is dependent on heat dissipation to an extent that strongly correlates with detailed nuances of the cooling architecture and job scheduling. We describe the history, data collection, cleaning, and analysis and give recommendations for future supercomputing systems. We make the data and our analysis codes publicly available.
Original language | English |
---|---|
Title of host publication | Proceedings of SC 2020 |
Subtitle of host publication | International Conference for High Performance Computing, Networking, Storage and Analysis |
Publisher | IEEE Computer Society |
ISBN (Electronic) | 9781728199986 |
DOIs | |
State | Published - Nov 2020 |
Event | 2020 International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2020 - Virtual, Atlanta, United States Duration: Nov 9 2020 → Nov 19 2020 |
Publication series
Name | International Conference for High Performance Computing, Networking, Storage and Analysis, SC |
---|---|
Volume | 2020-November |
ISSN (Print) | 2167-4329 |
ISSN (Electronic) | 2167-4337 |
Conference
Conference | 2020 International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2020 |
---|---|
Country/Territory | United States |
City | Virtual, Atlanta |
Period | 11/9/20 → 11/19/20 |
Funding
This work was sponsored by the U.S. Department of Energy’s Office of Advanced Scientific Computing Research. This manuscript has been authored by UT-Battelle, LLC under Contract No. DE-AC05-00OR22725 with the U.S. Department of Energy. The United States Government retains and the publisher, by accepting the article for publication, acknowledges that the United States Government retains a non-exclusive, paid-up, irrevocable, world-wide license to publish or reproduce the published form of this manuscript, or allow others to do so, for United States Government purposes. The Department of Energy will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan (http://energy.gov/downloads/doe-public-access-plan). This research used resources of the Oak Ridge Leadership Computing Facility at the Oak Ridge National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC05-00OR22725. This work was supported by the U.S. Department of Energy, Office of Science, Office of Advanced Scientific Computing Research, Resilience for Extreme Scale Supercomputing Systems Program, with program managers Robinson Pino and Lucy Nowell.
Funders | Funder number |
---|---|
Advanced Scientific Computing Research | DE-AC05-00OR22725 |
U.S. Department of Energy | |
U.S. Department of Energy | DE-AC05-00OR22725 |
Office of Science | |
Advanced Scientific Computing Research | |
U.S. Department of Energy | |
Office of Science |
Keywords
- Cox regression
- Cray
- GPU
- GPU failure data set
- Kaplan-Meier survival
- MTBF
- NVIDIA
- large-scale systems
- log analysis
- reliability
- supercomputer
Fingerprint
Dive into the research topics of 'Gpu lifetimes on titan supercomputer: Survival analysis and reliability'. Together they form a unique fingerprint.Datasets
-
GPU Lifetimes on Titan Supercomputer: Survival Analysis and Reliability
Shankar, M. (Creator), Ostrouchov, G. (Creator), Maxwell, D. (Creator), Rogers, J. (Creator), Ashraf, R. A. (Creator) & Engelmann, C. (Creator), Constellation by Oak Ridge Leadership Computing Facility (OLCF), Sep 2 2020
DOI: 10.13139/ORNLNCCS/1657202
Dataset
-
SMC 2021 Data Challenge: Analyzing Resource Utilization and User Behavior on Titan Supercomputer
Dash, S. (Creator), Paul, A. K. (Creator), Oral, S. (Creator) & Wang, F. (Creator), Constellation by Oak Ridge Leadership Computing Facility (OLCF), Mar 29 2021
Dataset