Serverless workflows for indexing large scientific data

Tyler J. Skluzacek, Ryan Chard, Ryan Wong, Zhuozhao Li, Yadu N. Babuji, Logan Ward, Ben Blaiszik, Kyle Chard, Ian Foster

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

14 Scopus citations

Abstract

The use and reuse of scientific data is ultimately dependent on the ability to understand what those data represent, how they were captured, and how they can be used. In many ways, data are only as useful as the metadata available to describe them. Unfortunately, due to growing data volumes, large and distributed collaborations, and a desire to store data for long periods of time, scientific “data lakes” quickly become disorganized and lack the metadata necessary to be useful to researchers. New automated approaches are needed to derive metadata from scientific files and to use these metadata for organization and discovery. Here we describe one such system, Xtract, a service capable of processing vast collections of scientific files and automatically extracting metadata from diverse file types. Xtract relies on function as a service models to enable scalable metadata extraction by orchestrating the execution of many, short-running extractor functions. To reduce data transfer costs, Xtract can be configured to deploy extractors centrally or near to the data (i.e., at the edge). We present a prototype implementation of Xtract and demonstrate that it can derive metadata from a 7 TB scientific data repository.

Original languageEnglish
Title of host publicationWOSC 2019 - Proceedings of the 2019 5th International Workshop on Serverless Computing, Part of Middleware 2019
PublisherAssociation for Computing Machinery, Inc
Pages43-48
Number of pages6
ISBN (Electronic)9781450370387
DOIs
StatePublished - Dec 9 2019
Externally publishedYes
Event5th International Workshop on Serverless Computing, WOSC 2019 - Part of Middleware 2019 - Davis, United States
Duration: Dec 9 2019Dec 13 2019

Publication series

NameWOSC 2019 - Proceedings of the 2019 5th International Workshop on Serverless Computing, Part of Middleware 2019

Conference

Conference5th International Workshop on Serverless Computing, WOSC 2019 - Part of Middleware 2019
Country/TerritoryUnited States
CityDavis
Period12/9/1912/13/19

Funding

This research used resources of the Argonne Leadership Computing Facility, which is a DOE Office of Science User Facility supported under Contract DE-AC02-06CH11357. We gratefully acknowledge the computing resources provided and operated by the Joint Laboratory for System Evaluation (JLSE) at Argonne National Laboratory as well as the Jetstream cloud for science and engineering [19].

Keywords

  • Data lakes
  • File systems
  • Materials science
  • Metadata extraction
  • Serverless

Fingerprint

Dive into the research topics of 'Serverless workflows for indexing large scientific data'. Together they form a unique fingerprint.

Cite this