Abstract
The power & energy demands of HPC machines have grown significantly. Modern exascale HPC systems require tens of megawatts of combined power for computing resources and cooling facilities at full capacity. The current energy trend is not sustainable for future HPC systems, and there is a need to work toward the energy efficiency aspect of HPC performance. Energy awareness of the HPC applications at the job level is essential for running an efficient HPC system. This work aims to develop a pipeline to provide a production-level system-wide overview of the HPC workloads' power profile while handling evolving workloads exhibiting new power trends. We developed an open-set classification model for HPC jobs based on the properties of power profiles to continuously provide a system-wide holistic view of recently completed jobs. The pipeline helps continuously monitor the job-level power usage pattern of HPC and enables us to capture the new trends in applications' power behavior. We employed a comprehensive set of techniques to generate job-level data, custom-designed feature extraction methods to extract critical features from jobs' power profiles, clustering techniques powered by generative modeling, and open-set classification for identifying job profiles into known classes or an unknown set. With extensive evaluations, we demonstrate the effectiveness of each component in our pipeline. We provide an analysis of the resulting clusters that characterize the power profile landscape of the Summit supercomputer from more than 60K jobs executed in a year. The open-set classification classifies the known data sets into known classes with high accuracy and identifies unknown data noints with over 85% accuracy.
Original language | English |
---|---|
Title of host publication | Proceedings - 2024 IEEE 44th International Conference on Distributed Computing Systems, ICDCS 2024 |
Publisher | Institute of Electrical and Electronics Engineers Inc. |
Pages | 93-104 |
Number of pages | 12 |
ISBN (Electronic) | 9798350386059 |
DOIs | |
State | Published - 2024 |
Event | 44th IEEE International Conference on Distributed Computing Systems, ICDCS 2024 - Jersey City, United States Duration: Jul 23 2024 → Jul 26 2024 |
Publication series
Name | Proceedings - International Conference on Distributed Computing Systems |
---|---|
ISSN (Print) | 1063-6927 |
ISSN (Electronic) | 2575-8411 |
Conference
Conference | 44th IEEE International Conference on Distributed Computing Systems, ICDCS 2024 |
---|---|
Country/Territory | United States |
City | Jersey City |
Period | 07/23/24 → 07/26/24 |
Funding
This research used resources of the Oak Ridge Leadership Computing Facility at the Oak Ridge National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC05-00OR22725.5
Keywords
- HPC
- HPC Energy Profiling
- Job Power Profile Clustering
- Open-set and closed-set Classification