Failures in large scale systems: Long-term measurement, analysis, and implications

Saurabh Gupta, Tirthak Patel, Christian Engelmann, Devesh Tiwari

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

85 Scopus citations

Abstract

Resilience is one of the key challenges in maintaining high eiciency of future extreme scale supercomputers. Researchers and system practitioners rely on ield-data studies to understand reliability characteristics and plan for future HPC systems. In this work, we compare and contrast the reliability characteristics of multiple large-scale HPC production systems. Our study covers more than one billion compute node hours across ive diferent systems over a period of 8 years. We conirm previous indings which continue to be valid, discover new indings, and discuss their implications.

Original languageEnglish
Title of host publicationProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2017
PublisherAssociation for Computing Machinery, Inc
ISBN (Electronic)9781450351140
DOIs
StatePublished - Nov 12 2017
EventInternational Conference for High Performance Computing, Networking, Storage and Analysis, SC 2017 - Denver, United States
Duration: Nov 12 2017Nov 17 2017

Publication series

NameProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2017

Conference

ConferenceInternational Conference for High Performance Computing, Networking, Storage and Analysis, SC 2017
Country/TerritoryUnited States
CityDenver
Period11/12/1711/17/17

Funding

∗This manuscript has been authored by UT-Battelle, LLC under Contract No. DE-AC05-00OR22725 with the U.S. Department of Energy. The United States Government retains and the publisher, by accepting the article for publication, acknowledges that the United States Government retains a non-exclusive, paid-up, irrevocable, world-wide license to publish or reproduce the published form of this manuscript, or allow others to do so, for United States Government purposes. The Department of Energy will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan (http://energy.gov/downloads/doe-public-access-plan).

Fingerprint

Dive into the research topics of 'Failures in large scale systems: Long-term measurement, analysis, and implications'. Together they form a unique fingerprint.

Cite this