Abstract
The increasing complexity and power/energy demands of heterogeneous exascale systems, such as the Frontier supercomputer, present significant challenges for measuring and optimizing power consumption in applications. Current tools either lack the resolution to capture fine-grained power and energy measurements, fail to validate in-band measurements against out-of-band power sensors, or cannot integrate this information with application performance events in a scalable manner. This paper introduces a novel open-source performance toolkit that integrates extended PAPI components with Score-P plugins to enable in-band, fine-grained power and energy measurements, while also supporting validation using power meter measurements for both CPUs and GPUs. One key contribution is the ability to perform millisecond-level power and energy measurements for AMD MI250X GPUs, mapping them to application performance events within a single trace and measurement system that scales. Our toolkit combines coarse-grained measurements from cray_pm counters with high-resolution metrics from rocm_smi and RAPL, converting GPU instantaneous accumulated energy into power to capture both transient and steady-state power behavior, a capability often missed by out-of-band and monitoring tools. By mapping these metrics to specific application regions, developers can identify energy hotspots, address inefficiencies in GPU kernel execution, and validate in-band measurements against external measurements. We demonstrate the effectiveness of this approach through case studies using benchmarks such as GPU rocblas_sgemm, BLIS c_blas_dgemm, and rocHPL, highlighting the variability of the measurements and the impact of transient power spikes on kernel-level efficiency.
| Original language | English |
|---|---|
| Title of host publication | Proceedings of CUG 2025 - Cray User Group Conference |
| Editors | Ashley Barker, Bilel Hadri, Colleen Bertoni, Nick Hagerty, Timothy W. Robinson |
| Publisher | Association for Computing Machinery, Inc |
| Pages | 135-146 |
| Number of pages | 12 |
| ISBN (Electronic) | 9798400713279 |
| DOIs | |
| State | Published - Nov 11 2025 |
| Event | Cray User Group, CUG 2025 - Jersey City, United States Duration: May 4 2025 → May 8 2025 |
Publication series
| Name | Proceedings of CUG 2025 - Cray User Group Conference |
|---|
Conference
| Conference | Cray User Group, CUG 2025 |
|---|---|
| Country/Territory | United States |
| City | Jersey City |
| Period | 05/4/25 → 05/8/25 |
Funding
This research used resources from the Oak Ridge Leadership Computing Facility, which is a US Department of Energy (DOE) Office of Science user facility supported under contract DE-AC05-00OR22725. This work is also supported by the DOE Office of Science, Advanced Scientific Computing Research, Express project "Leveraging OpenSource Simulators to Enable hw/sw Co-design of Next-generation HPC Systems" (DE-FOA-0002950).
Keywords
- AMD GPUs
- Exascale computing systems
- HPC applications
- Performance Tools
- Power and energy measurements