Abstract
We introduce CAESAR, a new framework for scientific data reduction that stands for Conditional AutoEncoder with Super-resolution for Augmented Reduction. The baseline model, CAESAR-V, is built on a standard variational autoencoder with scale hyperpriors and super-resolution modules to achieve high compression. It encodes data into a latent space and uses learned priors for compact, information-rich representations. The enhanced version, CAESAR-D, begins by compressing keyframes using an autoencoder and extends the architecture by incorporating conditional diffusion to interpolate the latent spaces of missing frames between keyframes. This enables high-fidelity reconstruction of intermediate data without requiring their explicit storage. By distinguishing CAESAR-V (variational) from CAESAR-D (diffusion-enhanced), we offer a modular family of solutions that balance compression efficiency, reconstruction accuracy, and computational cost for scientific data workflows. Additionally, we develop a GPU-accelerated postprocessing module which enforces error bounds on the reconstructed data, achieving real-time compression while maintaining rigorous accuracy guarantees. Experimental results across multiple scientific datasets demonstrate that our framework achieves up to 10× higher compression ratios compared to rule-based compressors such as SZ3. This work provides a scalable, domain-adaptive solution for efficient storage and transmission of large-scale scientific simulation data.
| Original language | English |
|---|---|
| Article number | 8977 |
| Journal | Applied Sciences (Switzerland) |
| Volume | 15 |
| Issue number | 16 |
| DOIs | |
| State | Published - Aug 2025 |
Funding
This research was funded by the U.S. Department of Energy under Grant Nos. DE-SC0021320 and DE-SC0022265.
Keywords
- error bound guarantees
- foundation model
- generative AI
- machine learning
- scientific data reduction