Skluma: A statistical learning pipeline for taming unkempt data repositories

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

5 Scopus citations

Abstract

Scientists' capacity to make use of existing data is predicated on their ability to find and understand those data. While significant progress has been made with respect to data publication, and indeed one can point to a number of well organized and highly utilized data repositories, there remain many such repositories in which archived data are poorly described and thus impossible to use. We present Skluma-an automated system designed to process vast amounts of data and extract deeply embedded metadata, latent topics, relationships between data, and contextual metadata derived from related documents. We show that Skluma can be used to organize and index a large climate data collection that totals more than 500GB of data in over a half-million files.

Original languageEnglish
Title of host publicationSSDBM 2017
Subtitle of host publication29th International Conference on Scientific and Statistical Database Management
PublisherAssociation for Computing Machinery
ISBN (Electronic)9781450352826
DOIs
StatePublished - Jun 27 2017
Externally publishedYes
Event29th International Conference on Scientific and Statistical Database Management, SSDBM 2017 - Chicago, United States
Duration: Jun 27 2017Jun 29 2017

Publication series

NameACM International Conference Proceeding Series
VolumePart F128636

Conference

Conference29th International Conference on Scientific and Statistical Database Management, SSDBM 2017
Country/TerritoryUnited States
CityChicago
Period06/27/1706/29/17

Keywords

  • Data integration
  • Data wrangling
  • Metadata extraction
  • Statistical learning

Fingerprint

Dive into the research topics of 'Skluma: A statistical learning pipeline for taming unkempt data repositories'. Together they form a unique fingerprint.

Cite this