A Digital Twin Framework for Liquid-cooled Supercomputers as Demonstrated at Exascale

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

6 Scopus citations

Abstract

We present ExaDigiT, an open-source framework for developing comprehensive digital twins of liquid-cooled supercomputers. It integrates three main modules: (1) a resource allocator and power simulator, (2) a transient thermo-fluidic cooling model, and (3) an augmented reality model of the supercomputer and central energy plant. The framework enables the study of 'what-if' scenarios, system optimizations, and virtual prototyping of future systems. Using Frontier as a case study, we demonstrate the framework's capabilities by replaying six months of system telemetry for systematic verification and validation. Such a comprehensive analysis of a liquid-cooled exascale supercomputer is the first of its kind. ExaDigiT elucidates complex transient cooling system dynamics, runs synthetic or real workloads, and predicts energy losses due to rectification and voltage conversion. Throughout our paper, we present lessons learned to benefit HPC practitioners developing similar digital twins. We envision the digital twin will be a key enabler for sustainable, energy-efficient supercomputing.

Original languageEnglish
Title of host publicationProceedings of SC 2024
Subtitle of host publicationInternational Conference for High Performance Computing, Networking, Storage and Analysis
PublisherIEEE Computer Society
ISBN (Electronic)9798350352917
DOIs
StatePublished - 2024
Event2024 International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2024 - Atlanta, United States
Duration: Nov 17 2024Nov 22 2024

Publication series

NameInternational Conference for High Performance Computing, Networking, Storage and Analysis, SC
ISSN (Print)2167-4329
ISSN (Electronic)2167-4337

Conference

Conference2024 International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2024
Country/TerritoryUnited States
CityAtlanta
Period11/17/2411/22/24

Funding

This research was sponsored by and used resources of the Oak Ridge Leadership Computing Facility (OLCF), which is a DOE Office of Science User Facility at the Oak Ridge National Laboratory (ORNL) supported by the U.S. Department of Energy under Contract No. DE-AC05-00OR22725. We thank the technical staff at ORNL, without whom this work would not have been possible, including: Scott Atchley, Matt Sieger, Chris Zimmer, Paul Abston, Jim Rogers, Matt Ezell, Robert Gillen, Dane de Wet, Kazi Asifuzzaman, Nathan Parkison, Cory Spradlin, Brian Reagan, John Holmen, Amir Shehata, Nick Hagerty, Seung-Hwan Lim, Ahmad Maroof Karimi. From HPE, we would like to thank Cullen Bash, Tim Dykes, Matt Slaby, and Justin Queen who provided us with invaluable help along the way. Thanks to Jake Webb from Cadre5, LLC for his significant contributions to the dashboard development. Moreover, we want to acknowledge our growing ExaDigiT open source community for their enthusiastic support and engagement, which has been a significant source of inspiration and motivation for this work. Finally, ChatGPT was utilized for converting Python code into pseudocode in Algorithm 1, grammar enhancements, and assisting with table formatting.

Keywords

  • augmented reality
  • data center power
  • digital twins
  • electronics cooling
  • energy efficiency
  • exascale computing

Fingerprint

Dive into the research topics of 'A Digital Twin Framework for Liquid-cooled Supercomputers as Demonstrated at Exascale'. Together they form a unique fingerprint.

Cite this