TY - GEN
T1 - Power Profile Monitoring and Tracking Evolution of System-Wide HPC Workloads
AU - Karimi, Ahmad Maroof
AU - Sattar, Naw Safrin
AU - Shin, Woong
AU - Wang, Feiyi
N1 - Publisher Copyright:
© 2024 IEEE.
PY - 2024
Y1 - 2024
N2 - The power & energy demands of HPC machines have grown significantly. Modern exascale HPC systems require tens of megawatts of combined power for computing resources and cooling facilities at full capacity. The current energy trend is not sustainable for future HPC systems, and there is a need to work toward the energy efficiency aspect of HPC performance. Energy awareness of the HPC applications at the job level is essential for running an efficient HPC system. This work aims to develop a pipeline to provide a production-level system-wide overview of the HPC workloads' power profile while handling evolving workloads exhibiting new power trends. We developed an open-set classification model for HPC jobs based on the properties of power profiles to continuously provide a system-wide holistic view of recently completed jobs. The pipeline helps continuously monitor the job-level power usage pattern of HPC and enables us to capture the new trends in applications' power behavior. We employed a comprehensive set of techniques to generate job-level data, custom-designed feature extraction methods to extract critical features from jobs' power profiles, clustering techniques powered by generative modeling, and open-set classification for identifying job profiles into known classes or an unknown set. With extensive evaluations, we demonstrate the effectiveness of each component in our pipeline. We provide an analysis of the resulting clusters that characterize the power profile landscape of the Summit supercomputer from more than 60K jobs executed in a year. The open-set classification classifies the known data sets into known classes with high accuracy and identifies unknown data noints with over 85% accuracy.
AB - The power & energy demands of HPC machines have grown significantly. Modern exascale HPC systems require tens of megawatts of combined power for computing resources and cooling facilities at full capacity. The current energy trend is not sustainable for future HPC systems, and there is a need to work toward the energy efficiency aspect of HPC performance. Energy awareness of the HPC applications at the job level is essential for running an efficient HPC system. This work aims to develop a pipeline to provide a production-level system-wide overview of the HPC workloads' power profile while handling evolving workloads exhibiting new power trends. We developed an open-set classification model for HPC jobs based on the properties of power profiles to continuously provide a system-wide holistic view of recently completed jobs. The pipeline helps continuously monitor the job-level power usage pattern of HPC and enables us to capture the new trends in applications' power behavior. We employed a comprehensive set of techniques to generate job-level data, custom-designed feature extraction methods to extract critical features from jobs' power profiles, clustering techniques powered by generative modeling, and open-set classification for identifying job profiles into known classes or an unknown set. With extensive evaluations, we demonstrate the effectiveness of each component in our pipeline. We provide an analysis of the resulting clusters that characterize the power profile landscape of the Summit supercomputer from more than 60K jobs executed in a year. The open-set classification classifies the known data sets into known classes with high accuracy and identifies unknown data noints with over 85% accuracy.
KW - HPC
KW - HPC Energy Profiling
KW - Job Power Profile Clustering
KW - Open-set and closed-set Classification
UR - http://www.scopus.com/inward/record.url?scp=85203151534&partnerID=8YFLogxK
U2 - 10.1109/ICDCS60910.2024.00018
DO - 10.1109/ICDCS60910.2024.00018
M3 - Conference contribution
AN - SCOPUS:85203151534
T3 - Proceedings - International Conference on Distributed Computing Systems
SP - 93
EP - 104
BT - Proceedings - 2024 IEEE 44th International Conference on Distributed Computing Systems, ICDCS 2024
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 44th IEEE International Conference on Distributed Computing Systems, ICDCS 2024
Y2 - 23 July 2024 through 26 July 2024
ER -