TY - GEN
T1 - A Digital Twin Framework for Liquid-cooled Supercomputers as Demonstrated at Exascale
AU - Brewer, Wesley
AU - Maiterth, Matthias
AU - Kumar, Vineet
AU - Wojda, Rafal
AU - Bouknight, Sedrick
AU - Hines, Jesse
AU - Shin, Woong
AU - Greenwood, Scott
AU - Grant, David
AU - Williams, Wesley
AU - Wang, Feiyi
N1 - Publisher Copyright:
© 2024 IEEE.
PY - 2024
Y1 - 2024
N2 - We present ExaDigiT, an open-source framework for developing comprehensive digital twins of liquid-cooled supercomputers. It integrates three main modules: (1) a resource allocator and power simulator, (2) a transient thermo-fluidic cooling model, and (3) an augmented reality model of the supercomputer and central energy plant. The framework enables the study of 'what-if' scenarios, system optimizations, and virtual prototyping of future systems. Using Frontier as a case study, we demonstrate the framework's capabilities by replaying six months of system telemetry for systematic verification and validation. Such a comprehensive analysis of a liquid-cooled exascale supercomputer is the first of its kind. ExaDigiT elucidates complex transient cooling system dynamics, runs synthetic or real workloads, and predicts energy losses due to rectification and voltage conversion. Throughout our paper, we present lessons learned to benefit HPC practitioners developing similar digital twins. We envision the digital twin will be a key enabler for sustainable, energy-efficient supercomputing.
AB - We present ExaDigiT, an open-source framework for developing comprehensive digital twins of liquid-cooled supercomputers. It integrates three main modules: (1) a resource allocator and power simulator, (2) a transient thermo-fluidic cooling model, and (3) an augmented reality model of the supercomputer and central energy plant. The framework enables the study of 'what-if' scenarios, system optimizations, and virtual prototyping of future systems. Using Frontier as a case study, we demonstrate the framework's capabilities by replaying six months of system telemetry for systematic verification and validation. Such a comprehensive analysis of a liquid-cooled exascale supercomputer is the first of its kind. ExaDigiT elucidates complex transient cooling system dynamics, runs synthetic or real workloads, and predicts energy losses due to rectification and voltage conversion. Throughout our paper, we present lessons learned to benefit HPC practitioners developing similar digital twins. We envision the digital twin will be a key enabler for sustainable, energy-efficient supercomputing.
KW - augmented reality
KW - data center power
KW - digital twins
KW - electronics cooling
KW - energy efficiency
KW - exascale computing
UR - http://www.scopus.com/inward/record.url?scp=85214926506&partnerID=8YFLogxK
U2 - 10.1109/SC41406.2024.00029
DO - 10.1109/SC41406.2024.00029
M3 - Conference contribution
AN - SCOPUS:85214926506
T3 - International Conference for High Performance Computing, Networking, Storage and Analysis, SC
BT - Proceedings of SC 2024
PB - IEEE Computer Society
T2 - 2024 International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2024
Y2 - 17 November 2024 through 22 November 2024
ER -