Experiences Integrating Database Support into the OLCF Test Harness

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

In 2005, the United States Department of Energy created the Oak Ridge Leadership Computing Facility (OLCF) to provide leadership computing resources up to 100 times more powerful than is currently available for researchers across academia, industry, and government. These leadership-class supercomputers are among the largest in the world and often diverge from conventional supercomputer architectures at the time of deployment, demanding flexible and thorough testing to ensure their functionality and performance. To support the required testing, OLCF developed the OLCF Test Harness (OTH) [7, 8]. The OTH has adapted through more than 20 years of novel architectures, including the arrivals of NVIDIA GPUs in Titan and AMD GPUs in Frontier. Unique challenges with each system provide opportunities for continuously improving the OTH. In the case of Frontier, OLCF’s latest leadership-class supercomputer deployed in 2022, one unique challenge was how to ensure the functionality of the more than 9,400 compute nodes. The OTH recently released version 3.0 [2], which implements support for logging test data to InfluxDB, and version 3.1, which adds support for Apache Kafka with the Apache Druid database. Grafana interfaces to these databases greatly improve the real-time monitoring capabilities and broader analysis capabilities for Frontier. The OTH is just one of many system testing frameworks available, but is among the first to explicitly enable and encourage database-based logging. Among three well-known testing frameworks surveyed (ReFrame, Ramble, and Pavilion2), ReFrame is the only framework that supports any form of database and documents features that could be extended for logging to other databases or formats. As such, we see an opportunity to openly discuss our experiences and lessons learned integrating database support into the OTH. In this work, we present a high-level description of the OTH and the database support within, and discuss the challenges, successes, failures, and future goals for the OTH.

Original languageEnglish
Title of host publicationProceedings of 2025 Workshops of the International Conference on High Performance Computing, Network, Storage, and Analysis, SC 2025 Workshops
PublisherAssociation for Computing Machinery, Inc
Pages662-668
Number of pages7
ISBN (Electronic)9798400718717
DOIs
StatePublished - Nov 15 2025
Event2025 Workshops of the International Conference on High Performance Computing, Network, Storage, and Analysis, SC 2025 Workshops - St. Louis, United States
Duration: Nov 16 2025Nov 21 2025

Publication series

NameProceedings of 2025 Workshops of the International Conference on High Performance Computing, Network, Storage, and Analysis, SC 2025 Workshops

Conference

Conference2025 Workshops of the International Conference on High Performance Computing, Network, Storage, and Analysis, SC 2025 Workshops
Country/TerritoryUnited States
CitySt. Louis
Period11/16/2511/21/25

Funding

This work would not be possible without the original authors and contributors of the OLCF Test Harness: Michael Brim, Reuben Budiardja, Wayne Joubert, Arnold Tharrington, and Verónica Melesse Vergara. Additionally thanks to Verónica for advice while refining this writing. This research used resources of the Oak Ridge Leadership Computing Facility at the Oak Ridge National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC05-00OR22725.

Keywords

  • data analysis
  • system testing
  • test monitoring
  • testing framework

Fingerprint

Dive into the research topics of 'Experiences Integrating Database Support into the OLCF Test Harness'. Together they form a unique fingerprint.

Cite this