Abstract
The increasing adoption of tightly integrated, heterogeneous architectures, combined with the slowdown of Moore’s law, has made application power and energy-driven optimizations critical to efficiently use high-performance computing systems. This paper introduces a newly developed open-source toolkit that seamlessly integrates the Linux real-time hardware monitoring program hwmon with the Performance Application Programming Interface and the Score-P performance measurement system, thereby enabling fine-grained power and energy measurements for high-performance computing applications. Our primary target platform is the Wombat test bed, which is a system based on the NVIDIA GH200 superchip. The toolkit can capture transient power peaks with high temporal resolution (50 ms) and, thanks to Score-P integration, can map power metrics to specific code regions, thereby providing actionable information on power-intensive operations and inefficiencies. The toolkit also provides a holistic view of both the power and the energy consumption of the entire GH200 superchip by covering all major components: the Grace CPU, the Hopper GPU, and the I/O subsystem. Experiments that use Locally Self-consistent Multiple Scattering, which is an application for first-principles calculations of materials developed at Oak Ridge National Laboratory, have demonstrated the tool’s ability to identify transient power spikes and uncover opportunities for energy-aware optimizations. Additionally, we introduce a Python-based utility for converting Open Trace Format 2 traces to Parquet format, thus enabling advanced data analysis for numerical integration methods applied to power data for accurate energy profiling.
| Original language | English |
|---|---|
| Title of host publication | Proceedings of International Conference on High Performance Computing in Asia-Pacific Region Workshops, HPC Asia 2025 Workshops |
| Publisher | Association for Computing Machinery, Inc |
| Pages | 11-22 |
| Number of pages | 12 |
| ISBN (Electronic) | 9798400713422 |
| DOIs | |
| State | Published - Apr 19 2025 |
| Event | 2025 International Conference on High Performance Computing in the Asia-Pacific Region, HPC Asia 2025 - Hsinchu, Taiwan, Province of China Duration: Feb 19 2025 → Feb 21 2025 |
Publication series
| Name | Proceedings of International Conference on High Performance Computing in Asia-Pacific Region Workshops, HPC Asia 2025 Workshops |
|---|
Conference
| Conference | 2025 International Conference on High Performance Computing in the Asia-Pacific Region, HPC Asia 2025 |
|---|---|
| Country/Territory | Taiwan, Province of China |
| City | Hsinchu |
| Period | 02/19/25 → 02/21/25 |
Funding
This research used resources from the Oak Ridge Leadership Computing Facility, which is a US Department of Energy (DOE) Office of Science user facility supported under contract DE-AC05-00OR22725. This material is based upon work supported by the U.S. Department of Energy, Office of Science, Office of Advanced Scientific Computing Research under contract number DE-AC05-00OR22725.
Keywords
- HPC applications
- LSMS application
- NVIDIA GPUs
- Power measurements
- performance tools