Towards a model to estimate the reliability of large-scale hybrid supercomputers

Elvis Rojas, Esteban Meneses, Terry Jones, Don Maxwell

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

1 Scopus citations

Abstract

Supercomputers stand as a fundamental tool for developing our understanding of the universe. State-of-the-art scientific simulations, big data analyses, and machine learning executions require high performance computing platforms. Such infrastructures have been growing lately with the addition of thousands of newly designed components, calling their resiliency into question. It is crucial to solidify our knowledge on the way supercomputers fail. Other recent studies have highlighted the importance of characterizing failures on supercomputers. This paper aims at modelling component failures of a supercomputer based on Mixed Weibull distributions. The model is built using a real-life multi-year failure record from a leadership-class supercomputer. Using several key observations from the data, we designed an analytical model that is robust enough to represent each of the main components of supercomputers, yet it is flexible enough to alter the composition of the machine and be able to predict resilience of future or hypothetical systems.

Original languageEnglish
Title of host publicationEuro-Par 2020
Subtitle of host publicationParallel Processing - 26th International Conference on Parallel and Distributed Computing, Proceedings
EditorsMaciej Malawski, Krzysztof Rzadca
PublisherSpringer
Pages37-51
Number of pages15
ISBN (Print)9783030576745
DOIs
StatePublished - 2020
Event26th International European Conference on Parallel and Distributed Computing, Euro-Par 2020 - Warsaw, Poland
Duration: Aug 24 2020Aug 28 2020

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume12247 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference26th International European Conference on Parallel and Distributed Computing, Euro-Par 2020
Country/TerritoryPoland
CityWarsaw
Period08/24/2008/28/20

Funding

Notice: This manuscript has been authored by UT-Battelle, LLC under Contract No. DE-AC05-00OR22725 with the U.S. Department of Energy. The United States Government retains and the publisher, by accepting the article for publication, acknowledges that the United States Government retains a non-exclusive, paid-up, irrevocable, worldwide license to publish or reproduce the published form of this manuscript, or allow others to do so, for United States Government purposes. The Department of Energy will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan (http://energy.gov/downloads/doe-public-access-plan).. Acknowledgment. This research was partially supported by a machine allocation on Kabré supercomputer at the Costa Rica National High Technology Center. Early versions of this manuscript received valuable comments from Prof. Marcela Alfaro-Cordoba at University of Costa Rica.

Keywords

  • Failure analysis
  • Failure modelling
  • Fault tolerance
  • Resilience

Fingerprint

Dive into the research topics of 'Towards a model to estimate the reliability of large-scale hybrid supercomputers'. Together they form a unique fingerprint.

Cite this