Abstract
In May 2022, the newest supercomputer to top the TOP 500 list was Frontier at Oak Ridge National Laboratory, demonstrating the capability of computing more than 1.1 quintillion (1018) floating-point calculations every second. Driving this ground-breaking rate of computing is Frontier's more than 37,000 graphics processing units (GPUs) and 9,408 central processing units (CPUs). In total, Frontier contains more than 60 million parts. At this scale, the smallest margin of error may generate hundreds of hardware errors across the system. These errors are capable of directly hindering world-class science performed on Frontier if not found. In this work, we describe and evaluate two strategies for finding hardware-level faults in Frontier's 9,408 compute nodes. There are two strategies developed: the first uses the Slurm scheduler to scavenge available compute time to run the node screen, the second builds upon the lessons learned in the first strategy and enforces a weekly screen of each node. Using June 2023 as a case study, we find that the first scheduling strategy consumed more than ten times the resources as the second scheduling strategy, but successfully detected five hardware defects in Frontier. We summarize the lessons learned while developing and running a node screen on the world's first exascale supercomputer.
Original language | English |
---|---|
Title of host publication | Proceedings of 2023 SC Workshops of the International Conference on High Performance Computing, Network, Storage, and Analysis, SC Workshops 2023 |
Publisher | Association for Computing Machinery |
Pages | 619-626 |
Number of pages | 8 |
ISBN (Electronic) | 9798400707858 |
DOIs | |
State | Published - Nov 12 2023 |
Event | 2023 International Conference on High Performance Computing, Network, Storage, and Analysis, SC Workshops 2023 - Denver, United States Duration: Nov 12 2023 → Nov 17 2023 |
Publication series
Name | ACM International Conference Proceeding Series |
---|
Conference
Conference | 2023 International Conference on High Performance Computing, Network, Storage, and Analysis, SC Workshops 2023 |
---|---|
Country/Territory | United States |
City | Denver |
Period | 11/12/23 → 11/17/23 |
Funding
This research used resources of the Oak Ridge Leadership Computing Facility, which is a DOE Office of Science User Facility supported under Contract DE-AC05-00OR22725. Notice of copyright: This manuscript has been authored by UT-Battelle, LLC under Contract No. DE-AC05-00OR22725 with the U.S. Department of Energy. The United States Government retains and the publisher, by accepting the article for publication, acknowledges that the United States Government retains a non-exclusive, paid-up, irrevocable, worldwide license to publish or reproduce the published form of this manuscript, or allow others to do so, for United States Government purposes. The Department of Energy will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan (http://energy.gov/downloads/doe-public-access-plan).
Keywords
- Slurm
- hardware validation
- high-performance computing
- quality assurance