Abstract
Resilience is one of the key challenges in maintaining high eiciency of future extreme scale supercomputers. Researchers and system practitioners rely on ield-data studies to understand reliability characteristics and plan for future HPC systems. In this work, we compare and contrast the reliability characteristics of multiple large-scale HPC production systems. Our study covers more than one billion compute node hours across ive diferent systems over a period of 8 years. We conirm previous indings which continue to be valid, discover new indings, and discuss their implications.
Original language | English |
---|---|
Title of host publication | Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2017 |
Publisher | Association for Computing Machinery, Inc |
ISBN (Electronic) | 9781450351140 |
DOIs | |
State | Published - Nov 12 2017 |
Event | International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2017 - Denver, United States Duration: Nov 12 2017 → Nov 17 2017 |
Publication series
Name | Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2017 |
---|
Conference
Conference | International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2017 |
---|---|
Country/Territory | United States |
City | Denver |
Period | 11/12/17 → 11/17/17 |
Funding
∗This manuscript has been authored by UT-Battelle, LLC under Contract No. DE-AC05-00OR22725 with the U.S. Department of Energy. The United States Government retains and the publisher, by accepting the article for publication, acknowledges that the United States Government retains a non-exclusive, paid-up, irrevocable, world-wide license to publish or reproduce the published form of this manuscript, or allow others to do so, for United States Government purposes. The Department of Energy will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan (http://energy.gov/downloads/doe-public-access-plan).