Preliminary Study on Fine-Grained Power and Energy Measurements on Grace Hopper GH200 with Open-Source Performance Tools

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

1 Scopus citations

Abstract

The increasing adoption of tightly integrated, heterogeneous architectures, combined with the slowdown of Moore’s law, has made application power and energy-driven optimizations critical to efficiently use high-performance computing systems. This paper introduces a newly developed open-source toolkit that seamlessly integrates the Linux real-time hardware monitoring program hwmon with the Performance Application Programming Interface and the Score-P performance measurement system, thereby enabling fine-grained power and energy measurements for high-performance computing applications. Our primary target platform is the Wombat test bed, which is a system based on the NVIDIA GH200 superchip. The toolkit can capture transient power peaks with high temporal resolution (50 ms) and, thanks to Score-P integration, can map power metrics to specific code regions, thereby providing actionable information on power-intensive operations and inefficiencies. The toolkit also provides a holistic view of both the power and the energy consumption of the entire GH200 superchip by covering all major components: the Grace CPU, the Hopper GPU, and the I/O subsystem. Experiments that use Locally Self-consistent Multiple Scattering, which is an application for first-principles calculations of materials developed at Oak Ridge National Laboratory, have demonstrated the tool’s ability to identify transient power spikes and uncover opportunities for energy-aware optimizations. Additionally, we introduce a Python-based utility for converting Open Trace Format 2 traces to Parquet format, thus enabling advanced data analysis for numerical integration methods applied to power data for accurate energy profiling.

Original languageEnglish
Title of host publicationProceedings of International Conference on High Performance Computing in Asia-Pacific Region Workshops, HPC Asia 2025 Workshops
PublisherAssociation for Computing Machinery, Inc
Pages11-22
Number of pages12
ISBN (Electronic)9798400713422
DOIs
StatePublished - Apr 19 2025
Event2025 International Conference on High Performance Computing in the Asia-Pacific Region, HPC Asia 2025 - Hsinchu, Taiwan, Province of China
Duration: Feb 19 2025Feb 21 2025

Publication series

NameProceedings of International Conference on High Performance Computing in Asia-Pacific Region Workshops, HPC Asia 2025 Workshops

Conference

Conference2025 International Conference on High Performance Computing in the Asia-Pacific Region, HPC Asia 2025
Country/TerritoryTaiwan, Province of China
CityHsinchu
Period02/19/2502/21/25

Funding

This research used resources from the Oak Ridge Leadership Computing Facility, which is a US Department of Energy (DOE) Office of Science user facility supported under contract DE-AC05-00OR22725. This material is based upon work supported by the U.S. Department of Energy, Office of Science, Office of Advanced Scientific Computing Research under contract number DE-AC05-00OR22725.

Keywords

  • HPC applications
  • LSMS application
  • NVIDIA GPUs
  • Power measurements
  • performance tools

Fingerprint

Dive into the research topics of 'Preliminary Study on Fine-Grained Power and Energy Measurements on Grace Hopper GH200 with Open-Source Performance Tools'. Together they form a unique fingerprint.

Cite this