Abstract
This paper examines how Data Readiness for AI (DRAI) principles apply to leadership-scale scientific datasets used to train foundation models. We analyze archetypal workflows across four representative domains - climate, nuclear fusion, bio/health, and materials - to identify common preprocessing patterns and domain-specific constraints. We introduce a two-dimensional readiness framework composed of Data Readiness Levels (raw to AI-ready) and Data Processing Stages (ingest to shard), both tailored to high performance computing (HPC) environments. This framework outlines key challenges in transforming scientific data for scalable AI training, emphasizing transformer-based generative models. Together, these dimensions form a conceptual maturity matrix that characterizes scientific data readiness and guides infrastructure development toward standardized, cross-domain support for scalable and reproducible AI for science.
| Original language | English |
|---|---|
| Title of host publication | 54th International Conference on Parallel Processing, ICPP 2025 - Workshops Proceedings |
| Publisher | Association for Computing Machinery, Inc |
| Pages | 18-24 |
| Number of pages | 7 |
| ISBN (Electronic) | 9798400721090 |
| DOIs | |
| State | Published - Dec 20 2025 |
| Event | 54th International Conference on Parallel Processing Workshop, ICPP 2025 - San Diego, United States Duration: Sep 8 2025 → Sep 11 2025 |
Publication series
| Name | 54th International Conference on Parallel Processing, ICPP 2025 - Workshops Proceedings |
|---|
Conference
| Conference | 54th International Conference on Parallel Processing Workshop, ICPP 2025 |
|---|---|
| Country/Territory | United States |
| City | San Diego |
| Period | 09/8/25 → 09/11/25 |
Funding
This research used resources of the Oak Ridge Leadership Computing Facility at the Oak Ridge National Laboratory, which is supported under the Advanced Scientific Computing Research programs in the Office of Science of the U.S. Department of Energy under Contract No. DE-AC05-00OR22725. We would also like to thank Max Lupo Pasini, Jens Glaser, Yashika Ghai, Fernanda Foertter, John Gounley, and Heidi Hanson for helpful information regarding AI readiness challenges for specific domains. Finally, OpenAI's ChatGPT was used to provide editing suggestions for several sentences in the paper, as well as to help analyze some of the preprocessing patterns, given large chunks of code.
Keywords
- AI-readiness
- bioinformatics
- climate science
- data preprocessing
- fusion research
- high-performance computing
- materials science
- scientific datasets
Fingerprint
Dive into the research topics of 'Data Readiness for Scientific AI at Scale'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver