Intelligent Sampling of Extreme-Scale Turbulence Datasets for Accurate and Efficient Spatiotemporal Model Training

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

With the end of Moore’s law and Dennard scaling, efficient training increasingly requires rethinking data volume. Can we train better models with significantly less data via intelligent subsampling? To explore this, we develop SICKLE, a sparse intelligent curation framework for efficient learning, featuring a novel maximum entropy (MaxEnt) sampling approach, scalable training, and energy benchmarking. We compare MaxEnt with random and phase-space sampling on large direct numerical simulation (DNS) datasets of turbulence. Evaluating SICKLE at scale on Frontier, we show that subsampling as a preprocessing step can, in many cases, improve model accuracy and substantially lower energy consumption, with observed reductions of up to 38×.

Original languageEnglish
Title of host publicationProceedings of 2025 Workshops of the International Conference on High Performance Computing, Network, Storage, and Analysis, SC 2025 Workshops
PublisherAssociation for Computing Machinery, Inc
Pages1-10
Number of pages10
ISBN (Electronic)9798400718717
DOIs
StatePublished - Nov 15 2025
Event2025 Workshops of the International Conference on High Performance Computing, Network, Storage, and Analysis, SC 2025 Workshops - St. Louis, United States
Duration: Nov 16 2025Nov 21 2025

Publication series

NameProceedings of 2025 Workshops of the International Conference on High Performance Computing, Network, Storage, and Analysis, SC 2025 Workshops

Conference

Conference2025 Workshops of the International Conference on High Performance Computing, Network, Storage, and Analysis, SC 2025 Workshops
Country/TerritoryUnited States
CitySt. Louis
Period11/16/2511/21/25

Funding

This research was sponsored by and used resources of the Oak Ridge Leadership Computing Facility (OLCF), which is a DOE Office of Science User Facility at the Oak Ridge National Laboratory (ORNL) supported by the U.S. Department of Energy under Contract No. DEAC05-00OR22725. This research was also sponsored by the Artificial Intelligence Initiative as part of the Laboratory Directed Research and Development (LDRD) Program of ORNL. We would also like to thank a number of colleagues for helpful information and advice along the way, including: Malik Hassanaly for help with UIPS, as well as Shantenu Jha, Joel Bretheim, Andrew Shao, Trey White, and Bronson Messer for providing feedback on the manuscript. The authors acknowledge the use of ChatGPT (version 4.5, OpenAI) for language editing, LaTeX formatting assistance, and structural suggestions.

Keywords

  • energy efficiency
  • maximum entropy
  • phase-space selection
  • scientific foundation models
  • subsampling

Fingerprint

Dive into the research topics of 'Intelligent Sampling of Extreme-Scale Turbulence Datasets for Accurate and Efficient Spatiotemporal Model Training'. Together they form a unique fingerprint.

Cite this