Understanding the Interplay between Hardware Errors and User Job Characteristics on the Titan Supercomputer

Seung Hwan Lim, Ross G. Miller, Sudharshan S. Vazhkudai

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

2 Scopus citations

Abstract

Designing dependable supercomputers begins with an understanding of errors in real-world, large-scale systems. The Titan supercomputer at Oak Ridge National Laboratory provides a unique opportunity to investigate errors when an actual system is actively used by multiple concurrent users and workloads from diverse domains at varying scales. This study presents a thorough analysis of 6, 908, 497 hardware errors from 18, 688 compute nodes of Titan for 312, 215 user jobs over a 3-year time period. Through careful joining of two system logs-the Machine Check Architecture (MCA) log and the job scheduler log-we show the correlated pattern of hardware errors for each job and user, in addition to individual descriptive statistics of errors, jobs, and users. Since the majority of hardware errors are memory errors, this study also shows the importance of error correcting in memory systems.

Original languageEnglish
Title of host publicationProceedings - 2020 IEEE 34th International Parallel and Distributed Processing Symposium, IPDPS 2020
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages180-190
Number of pages11
ISBN (Electronic)9781728168760
DOIs
StatePublished - May 2020
Event34th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2020 - New Orleans, United States
Duration: May 18 2020May 22 2020

Publication series

NameProceedings - 2020 IEEE 34th International Parallel and Distributed Processing Symposium, IPDPS 2020

Conference

Conference34th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2020
Country/TerritoryUnited States
CityNew Orleans
Period05/18/2005/22/20

Funding

This manuscript has been authored by UT-Battelle, LLC, under contract DE-AC05-00OR22725 with the US Department of Energy (DOE). The US government retains and the publisher, by accepting the article for publication, acknowledges that the US government retains a nonexclusive, paid-up, irrevocable, worldwide license to publish or reproduce the published form of this manuscript, or allow others to do so, for US government purposes. DOE will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan (http://energy.gov/downloads/doe-public-access-plan).

FundersFunder number
U.S. Department of Energy

    Fingerprint

    Dive into the research topics of 'Understanding the Interplay between Hardware Errors and User Job Characteristics on the Titan Supercomputer'. Together they form a unique fingerprint.

    Cite this