Power Profile Monitoring and Tracking Evolution of System-Wide HPC Workloads

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

The power & energy demands of HPC machines have grown significantly. Modern exascale HPC systems require tens of megawatts of combined power for computing resources and cooling facilities at full capacity. The current energy trend is not sustainable for future HPC systems, and there is a need to work toward the energy efficiency aspect of HPC performance. Energy awareness of the HPC applications at the job level is essential for running an efficient HPC system. This work aims to develop a pipeline to provide a production-level system-wide overview of the HPC workloads' power profile while handling evolving workloads exhibiting new power trends. We developed an open-set classification model for HPC jobs based on the properties of power profiles to continuously provide a system-wide holistic view of recently completed jobs. The pipeline helps continuously monitor the job-level power usage pattern of HPC and enables us to capture the new trends in applications' power behavior. We employed a comprehensive set of techniques to generate job-level data, custom-designed feature extraction methods to extract critical features from jobs' power profiles, clustering techniques powered by generative modeling, and open-set classification for identifying job profiles into known classes or an unknown set. With extensive evaluations, we demonstrate the effectiveness of each component in our pipeline. We provide an analysis of the resulting clusters that characterize the power profile landscape of the Summit supercomputer from more than 60K jobs executed in a year. The open-set classification classifies the known data sets into known classes with high accuracy and identifies unknown data noints with over 85% accuracy.

Original languageEnglish
Title of host publicationProceedings - 2024 IEEE 44th International Conference on Distributed Computing Systems, ICDCS 2024
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages93-104
Number of pages12
ISBN (Electronic)9798350386059
DOIs
StatePublished - 2024
Event44th IEEE International Conference on Distributed Computing Systems, ICDCS 2024 - Jersey City, United States
Duration: Jul 23 2024Jul 26 2024

Publication series

NameProceedings - International Conference on Distributed Computing Systems
ISSN (Print)1063-6927
ISSN (Electronic)2575-8411

Conference

Conference44th IEEE International Conference on Distributed Computing Systems, ICDCS 2024
Country/TerritoryUnited States
CityJersey City
Period07/23/2407/26/24

Keywords

  • HPC
  • HPC Energy Profiling
  • Job Power Profile Clustering
  • Open-set and closed-set Classification

Fingerprint

Dive into the research topics of 'Power Profile Monitoring and Tracking Evolution of System-Wide HPC Workloads'. Together they form a unique fingerprint.

Cite this