Long Term Per-Component Power and Thermal Measurements of the OLCF Summit System

Dataset

Description

As we move into the exascale era, the power and energy footprints of high-performance computing (HPC) systems have grown significantly larger. Due to the harsh power and thermal conditions the system, components are exposed to extreme operating conditions. Operation of such modern HPC systems requires deep insights into long term system behavior to maintain its efficiency as well as its longevity. To help the HPC community to gain such insights, we provide a dataset that records the long-term power and thermal behavior of the 200PF pre-exascale supercomputer at the Oak Ridge Leadership Computing Facility (OLCF), Summit. This system is an IBM AC922 based system that has 9,252 IBM Power9 CPUs and 27,756 Nvidia V100 GPUs and can consume up to 13MW power at peak. Heat removal is performed using medium temperature direct liquid cooling and rear-door heat exchanger based secondary cooling loop. Originally extracted from a high-resolution (1Hz) per-component (GPUs, CPUs) measurements from the system, we primarily provide a dataset that has 10-second and 1-minute mean power & thermal measurements selected from five month-long segments over the course of 2020 (January & August), 2021 (February & August), and 2022 (January). For convenience, we also provide various sub datasets randomly sampled from the time and space (hosts) of the cluster. Further details and example code for analysis can be found in the following GitHub repository: https://github.com/at-aaims/summit_power_and_thermal_data

Cite this