SPARSE L1-AUTOENCODERS FOR SCIENTIFIC DATA COMPRESSION

Research output: Contribution to journalArticlepeer-review

Abstract

Scientific datasets present unique challenges for machine learning–driven compression methods, including more stringent requirements on accuracy and mitigation of potential invalidating artifacts. Drawing on results from compressed sensing and rate-distortion theory, we introduce effective data compression methods by developing autoencoders using high-dimensional latent spaces that are L1 regularized to obtain sparse low-dimensional representations. We show how these information-rich latent spaces can be used to mitigate blurring and other artifacts to obtain highly effective data compression methods for scientific data. We demonstrate our methods for short angle scattering (SAS) datasets, showing they can achieve compression ratios around two orders of magnitude and in some cases better. Our compression methods show promise for use in addressing current bottlenecks in transmission, storage, and analysis in high-performance distributed computing environments. This is central to processing the large volume of scientific data, for instance, for SAS data being generated at shared experimental facilities around the world to support scientific investigations. Our approaches provide general ways for obtaining specialized compression methods for targeted scientific datasets and are not limited to specific applications.

Original languageEnglish
Pages (from-to)51-71
Number of pages21
JournalJournal of Machine Learning for Modeling and Computing
Volume6
Issue number4
DOIs
StatePublished - 2025

Funding

This work was partially supported by the National Science Foundation (NSF) under Grant DMS-2152661 (M. Chung) and grant DMS-2306101 (P. Atzberger). This work was partially supported by UT-Battelle, LLC, under contract DE-AC05-00OR22725, with the US Department of Energy (DOE), Office of Advanced Scientific Computing Research (R. Archibald). The publisher, by accepting the work for publication, acknowledges that the US government retains a nonexclusive, paid-up, irrevocable, world-wide license to publish or reproduce the submitted manuscript version of this work, or allow others to do so, for US government purposes. DOE will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan (http://energy.gov/downloads/doe-public-access-plan).

Keywords

  • autoencoders
  • compression
  • sparsity

Fingerprint

Dive into the research topics of 'SPARSE L1-AUTOENCODERS FOR SCIENTIFIC DATA COMPRESSION'. Together they form a unique fingerprint.

Cite this