Towards Diverse and Representative Global Pretraining Datasets for Remote Sensing Foundation Models

Research output: Contribution to conferencePaperpeer-review

Abstract

The design of a pretraining dataset is emerging as a critical component for the generality of foundation models. In the remote sensing realm, large volumes of imagery and benchmark datasets exist that can be leveraged to pretrain foundation models, however using this imagery in absence of a well-crafted sampling strategy is inefficient and has the potential to create biased and less generalizable models. Here, we provide a discussion and vision for the curation and assessment of pretraining datasets for remote sensing geospatial foundation models. We highlight the importance of geographic, temporal, and image acquisition diversity and review possible strategies to enable such diversity at global scale. In addition to these characteristics, support for various spatial-temporal pretext tasks within the dataset is also critical. Ultimately, our primary objective is to place emphasis on and draw attention to the data curation stage of the foundation model development pipeline. By doing so, we think it is possible to reduce biases of geospatial foundation models, as well as enable broader generalization to downstream remote sensing tasks and applications.

Original languageEnglish
Pages2723-2728
Number of pages6
DOIs
StatePublished - 2024
Event2024 IEEE International Geoscience and Remote Sensing Symposium, IGARSS 2024 - Athens, Greece
Duration: Jul 7 2024Jul 12 2024

Conference

Conference2024 IEEE International Geoscience and Remote Sensing Symposium, IGARSS 2024
Country/TerritoryGreece
CityAthens
Period07/7/2407/12/24

Funding

We acknowledge that this manuscript has been authored by UT-Battelle, LLC under Contract No. DE-AC05-00OR22725 with the U.S. Department of Energy. The United States Government retains and the publisher, by accepting the article for publication, acknowledges that the United States Government retains a non-exclusive, paid-up, irrevocable, world-wide license to publish or reproduce the published form of this manuscript, or allow others to do so, for United States Government purposes. DOE will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan (http://energy.gov/downloads/doepublic-access-plan).

FundersFunder number
U.S. Department of Energy
United States Government
DOE Public Access Plan

    Keywords

    • datasets
    • foundation models
    • pretraining
    • self-supervised learning
    • unsupervised learning

    Fingerprint

    Dive into the research topics of 'Towards Diverse and Representative Global Pretraining Datasets for Remote Sensing Foundation Models'. Together they form a unique fingerprint.

    Cite this