Skluma: An extensible metadata extraction pipeline for disorganized data

  • Tyler J. Skluzacek
  • , Rohan Kumar
  • , Ryan Chard
  • , Galen Harrison
  • , Paul Beckman
  • , Kyle Chard
  • , Ian Foster

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

15 Scopus citations

Abstract

To mitigate the effects of high-velocity data expansion and to automate the organization of filesystems and data repositories, we have developed Skluma-a system that automatically processes a target filesystem or repository, extracts content-and context-based metadata, and organizes extracted metadata for subsequent use. Skluma is able to extract diverse metadata, including aggregate values derived from embedded structured data; named entities and latent topics buried within free-text documents; and content encoded in images. Skluma implements an overarching probabilistic pipeline to extract increasingly specific metadata from files. It applies machine learning methods to determine file types, dynamically prioritizes and then executes a suite of metadata extractors, and explores contextual metadata based on relationships among files. The derived metadata, represented in JSON, describes probabilistic knowledge of each file that may be subsequently used for discovery or organization. Skluma's architecture enables it to be deployed both locally and used as an on-demand, cloud-hosted service to create and execute dynamic extraction workflows on massive numbers of files. It is modular and extensible-allowing users to contribute their own specialized metadata extractors. Thus far we have tested Skluma on local filesystems, remote FTP-accessible servers, and publicly-accessible Globus endpoints. We have demonstrated its efficacy by applying it to a scientific environmental data repository of more than 500,000 files. We show that we can extract metadata from those files with modest cloud costs in a few hours.

Original languageEnglish
Title of host publicationProceedings - IEEE 14th International Conference on eScience, e-Science 2018
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages256-266
Number of pages11
ISBN (Electronic)9781538691564
DOIs
StatePublished - Dec 24 2018
Externally publishedYes
Event14th IEEE International Conference on eScience, e-Science 2018 - Amsterdam, Netherlands
Duration: Oct 29 2018Nov 1 2018

Publication series

NameProceedings - IEEE 14th International Conference on eScience, e-Science 2018

Conference

Conference14th IEEE International Conference on eScience, e-Science 2018
Country/TerritoryNetherlands
CityAmsterdam
Period10/29/1811/1/18

Funding

This work was supported in part by the UChicago CERES Center for Unstoppable Computing, DOE contract DE-AC02-06CH11357, and NIH contract 1U54EB020406. We also ac-knowledgeresearch credits provided by Amazon Web Services and compute hours from Jetstream [38]. We thank Emily Herron, Dr. Shan Lu, Goutham Rajendran, and Chaofeng Wu for their input on extractors.

Keywords

  • Data Swamp
  • Metadata extraction
  • Scientific repository

Fingerprint

Dive into the research topics of 'Skluma: An extensible metadata extraction pipeline for disorganized data'. Together they form a unique fingerprint.

Cite this