Abstract
Checkpoint/Restart (C/R) strategies are vital for fault tolerance in PDE-based scientific simulations, yet traditional checkpointing incurs significant I/O overhead. Lossy compression offers a scalable solution by reducing checkpoint data size, but conventional methods often lack control over physical invariants (e.g., energy), leading to instability such as oscillations or divergence in Partial Differential Equations (PDE) systems. This paper introduces a stability-preserving compression approach tailored for PDE simulations by explicitly controlling kinetic and potential energy perturbations to ensure stable restarts. Extensive experiments conducted across diverse PDE configurations demonstrate that our method maintains numerical stability with minimal error magnification-even across multiple checkpoint-restart cycles-outperforming state-of-the-art lossy compressors. Parallel evaluations on the Frontier supercomputer show up to 8.4× improvement in checkpoint write performance and 6.3× in read performance, while maintaining relative L2 errors ∼2e-6 throughout continued simulation. These results provide practical guidance for balancing compression accuracy, stability, and computational efficiency in large-scale PDE applications.
| Original language | English |
|---|---|
| Title of host publication | Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, SC 2025 |
| Publisher | Association for Computing Machinery, Inc |
| Pages | 1992-2005 |
| Number of pages | 14 |
| ISBN (Electronic) | 9798400714665 |
| DOIs | |
| State | Published - Nov 15 2025 |
| Event | 2025 International Conference for High Performance Computing, Networking, Storage, and Analysis, SC 2025 - St. Louis, United States Duration: Nov 16 2025 → Nov 21 2025 |
Publication series
| Name | Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, SC 2025 |
|---|
Conference
| Conference | 2025 International Conference for High Performance Computing, Networking, Storage, and Analysis, SC 2025 |
|---|---|
| Country/Territory | United States |
| City | St. Louis |
| Period | 11/16/25 → 11/21/25 |
Funding
The research is supported in part by the U.S. Department of Energy (DOE) RAPIDS-2 SciDAC and Sirius2 projects under contract number DE-AC05-00OR22725, and National Science Foundation (NSF) under the grants DMS-2324364, OAC-2313122, OAC-2311756, OAC-2311757 and OAC-2144403. This research used resources of the Oak Ridge Leadership Computing Facility (OLCF), which is a DOE Office of Science User Facility.
Keywords
- Checkpoint-restart
- large-scale PDEs
- lossy compression
- stability preservation