Abstract
Aswe approach the exascale computing era, the focused understanding of power consumption and its overall constraint on HPC architectures and applications are becoming increasingly paramount. Summit, located at the Oak Ridge Leadership Computing Facility (OLCF), is one of the fastest and largest pre-exascale platforms in operation today. This paper provides a first-order examination and analysis of power consumption at the component-level, node-level, and system-level, from all 4,626 Summit compute nodes, each with over 100 metrics at 1Hz frequency over the entire year of 2020. We also investigate the power characteristics and energy efficiency of over 840k Summit jobs and 250k GPU failure logs for further operational insights. To the best of our knowledge, this is the first systematic analysis of power data of HPC system at this scale.
Original language | English |
---|---|
Title of host publication | Proceedings of SC 2021 |
Subtitle of host publication | The International Conference for High Performance Computing, Networking, Storage and Analysis: Science and Beyond |
Publisher | IEEE Computer Society |
ISBN (Electronic) | 9781450384421 |
DOIs | |
State | Published - Nov 14 2021 |
Event | 33rd International Conference for High Performance Computing, Networking, Storage and Analysis: Science and Beyond, SC 2021 - Virtual, Online, United States Duration: Nov 14 2021 → Nov 19 2021 |
Publication series
Name | International Conference for High Performance Computing, Networking, Storage and Analysis, SC |
---|---|
ISSN (Print) | 2167-4329 |
ISSN (Electronic) | 2167-4337 |
Conference
Conference | 33rd International Conference for High Performance Computing, Networking, Storage and Analysis: Science and Beyond, SC 2021 |
---|---|
Country/Territory | United States |
City | Virtual, Online |
Period | 11/14/21 → 11/19/21 |
Funding
This work was supported by, and used the resources of, the Oak Ridge Leadership Computing Facility, located in the National Center for Computational Sciences at ORNL, which is managed by UT Battelle, LLC for the U.S. DOE (under the contract No. DE-AC05-00OR22725). The US government retains and the publisher, by accepting the article for publication, acknowledges that the US government retains a nonexclusive, paid-up, irrevocable, world-wide license to publish or reproduce the published form of this manuscript, or allow others to do so, for US government purposes. DOE will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan (http://energy.gov/downloads/doe-public-access-plan).
Keywords
- Data analysis
- Energy
- GPU
- HPC
- Power
- Reliability
- Telemetry