Skip to main navigation Skip to search Skip to main content

Data Readiness for Scientific AI at Scale

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

This paper examines how Data Readiness for AI (DRAI) principles apply to leadership-scale scientific datasets used to train foundation models. We analyze archetypal workflows across four representative domains - climate, nuclear fusion, bio/health, and materials - to identify common preprocessing patterns and domain-specific constraints. We introduce a two-dimensional readiness framework composed of Data Readiness Levels (raw to AI-ready) and Data Processing Stages (ingest to shard), both tailored to high performance computing (HPC) environments. This framework outlines key challenges in transforming scientific data for scalable AI training, emphasizing transformer-based generative models. Together, these dimensions form a conceptual maturity matrix that characterizes scientific data readiness and guides infrastructure development toward standardized, cross-domain support for scalable and reproducible AI for science.

Original languageEnglish
Title of host publication54th International Conference on Parallel Processing, ICPP 2025 - Workshops Proceedings
PublisherAssociation for Computing Machinery, Inc
Pages18-24
Number of pages7
ISBN (Electronic)9798400721090
DOIs
StatePublished - Dec 20 2025
Event54th International Conference on Parallel Processing Workshop, ICPP 2025 - San Diego, United States
Duration: Sep 8 2025Sep 11 2025

Publication series

Name54th International Conference on Parallel Processing, ICPP 2025 - Workshops Proceedings

Conference

Conference54th International Conference on Parallel Processing Workshop, ICPP 2025
Country/TerritoryUnited States
CitySan Diego
Period09/8/2509/11/25

Funding

This research used resources of the Oak Ridge Leadership Computing Facility at the Oak Ridge National Laboratory, which is supported under the Advanced Scientific Computing Research programs in the Office of Science of the U.S. Department of Energy under Contract No. DE-AC05-00OR22725. We would also like to thank Max Lupo Pasini, Jens Glaser, Yashika Ghai, Fernanda Foertter, John Gounley, and Heidi Hanson for helpful information regarding AI readiness challenges for specific domains. Finally, OpenAI's ChatGPT was used to provide editing suggestions for several sentences in the paper, as well as to help analyze some of the preprocessing patterns, given large chunks of code.

Keywords

  • AI-readiness
  • bioinformatics
  • climate science
  • data preprocessing
  • fusion research
  • high-performance computing
  • materials science
  • scientific datasets

Fingerprint

Dive into the research topics of 'Data Readiness for Scientific AI at Scale'. Together they form a unique fingerprint.

Cite this