Abstract
Resilience is one of the key challenges in maintaining high efficiency of future extreme scale supercomputers. Researchers and system practitioners rely on field-data studies to understand reliability characteristics and plan for future HPC systems. In this work, we compare and contrast the reliability characteristics of multiple largescale HPC production systems. Our study covers more than one billion compute node hours across five different systems over a period of 8 years. We confirm previous findings which continue to be valid, discover new findings, and discuss their implications.
Original language | English |
---|---|
Title of host publication | SC 2017 - International Conference for High Performance Computing, Networking, Storage and Analysis |
Publisher | IEEE Computer Society |
ISBN (Electronic) | 9781450351140 |
DOIs | |
State | Published - 2017 |
Event | 2017 International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2017 - Denver, United States Duration: Nov 12 2017 → Nov 17 2017 |
Publication series
Name | International Conference for High Performance Computing, Networking, Storage and Analysis, SC |
---|---|
Volume | 2017-November |
ISSN (Print) | 2167-4329 |
ISSN (Electronic) | 2167-4337 |
Conference
Conference | 2017 International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2017 |
---|---|
Country/Territory | United States |
City | Denver |
Period | 11/12/17 → 11/17/17 |
Funding
∗This manuscript has been authored by UT-Battelle, LLC under Contract No. DE-AC05-00OR22725 with the U.S. Department of Energy. The United States Government retains and the publisher, by accepting the article for publication, acknowledges that the United States Government retains a non-exclusive, paid-up, irrevocable, world-wide license to publish or reproduce the published form of this manuscript, or allow others to do so, for United States Government purposes. The Department of Energy will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan (http://energy.gov/downloads/doe-public-access-plan). We are thankful to reviewers for their constructive feedback that has helped us improve the quality of this paper. This material is based upon work supported by Global Resilience Institute at Northeastern University, and by the U.S. Department of Energy, Oice of Science, Oice of Advanced Scientiic Computing Research, program manager Lucy Nowell, under contract number DE-AC05-00OR22725. Saurabh Gupta performed this work while employed at the Oak Ridge National Laboratory.