Klimatic: A virtual data lake for harvesting and distribution of geospatial data

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

20 Scopus citations

Abstract

Many interesting geospatial datasets are publicly accessible on web sites and other online repositories. However, the sheer number of datasets and locations, plus a lack of support for cross-repository search, makes it difficult for researchers to discover and integrate relevant data. We describe here early results from a system, Klimatic, that aims to overcome these barriers to discovery and use by automating the tasks of crawling, indexing, integrating, and distributing geospatial data. Klimatic implements a scalable crawling and processing architecture that uses an elastic container-based model to locate and retrieve relevant datasets and to extract metadata from headers and within files to build a global index of known geospatial data. In so doing, we create an expansive geospatial virtual data lake that records the location, formats, and other characteristics of large numbers of geospatial datasets while also caching popular data subsets for rapid access. A flexible query interface allows users to request data that satisfy supplied type, spatial, temporal, and provider specifications; in processing such queries, the system uses interpolation and aggregation to combine data of different types, data formats, resolutions, and bounds. Klimatic has so far incorporated more than 10,000 datasets from over 120 sources and has been demonstrated to scale well with data size and query complexity.

Original languageEnglish
Title of host publicationProceedings of PDSW-DISCS 2016
Subtitle of host publication1st Joint International Workshop on Parallel Data Storage and Data Intensive Scalable Computing Systems - Held in conjunction with SC16: The International Conference for High Performance Computing, Networking, Storage and Analysis
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages31-36
Number of pages6
ISBN (Electronic)9781509052165
DOIs
StatePublished - Jan 30 2017
Externally publishedYes
Event1st Joint International Workshop on Parallel Data Storage and Data Intensive Scalable Computing Systems, PDSW-DISCS 2016 - Salt Lake City, United States
Duration: Nov 14 2016 → …

Publication series

NameProceedings of PDSW-DISCS 2016: 1st Joint International Workshop on Parallel Data Storage and Data Intensive Scalable Computing Systems - Held in conjunction with SC16: The International Conference for High Performance Computing, Networking, Storage and Analysis

Conference

Conference1st Joint International Workshop on Parallel Data Storage and Data Intensive Scalable Computing Systems, PDSW-DISCS 2016
Country/TerritoryUnited States
CitySalt Lake City
Period11/14/16 → …

Funding

This work was supported in part by DOE contract DE-AC02-06CH11357 and by NSF Decision Making Under Uncertainty program award 0951576.

Fingerprint

Dive into the research topics of 'Klimatic: A virtual data lake for harvesting and distribution of geospatial data'. Together they form a unique fingerprint.

Cite this