Fine-Grained Application Energy and Power Measurements on the Frontier Exascale System

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

The increasing complexity and power/energy demands of heterogeneous exascale systems, such as the Frontier supercomputer, present significant challenges for measuring and optimizing power consumption in applications. Current tools either lack the resolution to capture fine-grained power and energy measurements, fail to validate in-band measurements against out-of-band power sensors, or cannot integrate this information with application performance events in a scalable manner. This paper introduces a novel open-source performance toolkit that integrates extended PAPI components with Score-P plugins to enable in-band, fine-grained power and energy measurements, while also supporting validation using power meter measurements for both CPUs and GPUs. One key contribution is the ability to perform millisecond-level power and energy measurements for AMD MI250X GPUs, mapping them to application performance events within a single trace and measurement system that scales. Our toolkit combines coarse-grained measurements from cray_pm counters with high-resolution metrics from rocm_smi and RAPL, converting GPU instantaneous accumulated energy into power to capture both transient and steady-state power behavior, a capability often missed by out-of-band and monitoring tools. By mapping these metrics to specific application regions, developers can identify energy hotspots, address inefficiencies in GPU kernel execution, and validate in-band measurements against external measurements. We demonstrate the effectiveness of this approach through case studies using benchmarks such as GPU rocblas_sgemm, BLIS c_blas_dgemm, and rocHPL, highlighting the variability of the measurements and the impact of transient power spikes on kernel-level efficiency.

Original languageEnglish
Title of host publicationProceedings of CUG 2025 - Cray User Group Conference
EditorsAshley Barker, Bilel Hadri, Colleen Bertoni, Nick Hagerty, Timothy W. Robinson
PublisherAssociation for Computing Machinery, Inc
Pages135-146
Number of pages12
ISBN (Electronic)9798400713279
DOIs
StatePublished - Nov 11 2025
EventCray User Group, CUG 2025 - Jersey City, United States
Duration: May 4 2025May 8 2025

Publication series

NameProceedings of CUG 2025 - Cray User Group Conference

Conference

ConferenceCray User Group, CUG 2025
Country/TerritoryUnited States
CityJersey City
Period05/4/2505/8/25

Funding

This research used resources from the Oak Ridge Leadership Computing Facility, which is a US Department of Energy (DOE) Office of Science user facility supported under contract DE-AC05-00OR22725. This work is also supported by the DOE Office of Science, Advanced Scientific Computing Research, Express project "Leveraging OpenSource Simulators to Enable hw/sw Co-design of Next-generation HPC Systems" (DE-FOA-0002950).

Keywords

  • AMD GPUs
  • Exascale computing systems
  • HPC applications
  • Performance Tools
  • Power and energy measurements

Fingerprint

Dive into the research topics of 'Fine-Grained Application Energy and Power Measurements on the Frontier Exascale System'. Together they form a unique fingerprint.

Cite this