A Serverless Framework for Distributed Bulk Metadata Extraction

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

14 Scopus citations

Abstract

We introduce Xtract, an automated and scalable system for bulk metadata extraction from large, distributed research data repositories. Xtract orchestrates the application of metadata extractors to groups of files, determining which extractors to apply to each file and, for each extractor and file, where to execute. A hybrid computing model, built on the funcX federated FaaS platform, enables Xtract to balance tradeoffs between extraction time and data transfer costs by dispatching each extraction task to the most appropriate location. Experiments on a range of clouds and supercomputers show that Xtract can efficiently process multi-million-file repositories by orchestrating the concurrent execution of container-based extractors on thousands of nodes. We highlight the flexibility of Xtract by applying it to a large, semi-curated scientific data repository and to an uncurated scientific Google Drive repository. We show that by remotely orchestrating metadata extraction across decentralized storage and compute nodes, Xtract can process large repositories in 50% of the time it takes just to transfer the same data to a machine within the same computing facility. We also show that when transferring data is necessary (e.g., no local compute is available), Xtract can scale to process files as fast as they are received, even over a multi-GB/s network.

Original languageEnglish
Title of host publicationHPDC 2021 - Proceedings of the 30th International Symposium on High-Performance Parallel and Distributed Computing
PublisherAssociation for Computing Machinery, Inc
Pages7-18
Number of pages12
ISBN (Electronic)9781450382175
DOIs
StatePublished - Jun 21 2021
Externally publishedYes
Event30th International Symposium on High-Performance Parallel and Distributed Computing, HPDC 2021 - Virtual, Online, Sweden
Duration: Jun 21 2021Jun 25 2021

Publication series

NameHPDC 2021 - Proceedings of the 30th International Symposium on High-Performance Parallel and Distributed Computing

Conference

Conference30th International Symposium on High-Performance Parallel and Distributed Computing, HPDC 2021
Country/TerritorySweden
CityVirtual, Online
Period06/21/2106/25/21

Funding

We gratefully acknowledge the computing resources provided and operated by the Research Computing Center at the University of Chicago; and the Joint Laboratory for System Evaluation (JLSE) and the Advanced Leadership Computing Facility (ALCF) at Argonne National Laboratory. This work was performed under financial award 70NANB19H005 from the U.S. Dept. of Commerce, National Institute of Standards and Technology as part of the Center for Hierarchical Materials Design (CHiMaD). This work is supported in part by the National Science Foundation under Grant No. 2004894.

Keywords

  • files
  • metadata extraction
  • search index
  • serverless
  • storage

Fingerprint

Dive into the research topics of 'A Serverless Framework for Distributed Bulk Metadata Extraction'. Together they form a unique fingerprint.

Cite this