Abstract
We introduce Xtract, an automated and scalable system for bulk metadata extraction from large, distributed research data repositories. Xtract orchestrates the application of metadata extractors to groups of files, determining which extractors to apply to each file and, for each extractor and file, where to execute. A hybrid computing model, built on the funcX federated FaaS platform, enables Xtract to balance tradeoffs between extraction time and data transfer costs by dispatching each extraction task to the most appropriate location. Experiments on a range of clouds and supercomputers show that Xtract can efficiently process multi-million-file repositories by orchestrating the concurrent execution of container-based extractors on thousands of nodes. We highlight the flexibility of Xtract by applying it to a large, semi-curated scientific data repository and to an uncurated scientific Google Drive repository. We show that by remotely orchestrating metadata extraction across decentralized storage and compute nodes, Xtract can process large repositories in 50% of the time it takes just to transfer the same data to a machine within the same computing facility. We also show that when transferring data is necessary (e.g., no local compute is available), Xtract can scale to process files as fast as they are received, even over a multi-GB/s network.
| Original language | English |
|---|---|
| Title of host publication | HPDC 2021 - Proceedings of the 30th International Symposium on High-Performance Parallel and Distributed Computing |
| Publisher | Association for Computing Machinery, Inc |
| Pages | 7-18 |
| Number of pages | 12 |
| ISBN (Electronic) | 9781450382175 |
| DOIs | |
| State | Published - Jun 21 2021 |
| Externally published | Yes |
| Event | 30th International Symposium on High-Performance Parallel and Distributed Computing, HPDC 2021 - Virtual, Online, Sweden Duration: Jun 21 2021 → Jun 25 2021 |
Publication series
| Name | HPDC 2021 - Proceedings of the 30th International Symposium on High-Performance Parallel and Distributed Computing |
|---|
Conference
| Conference | 30th International Symposium on High-Performance Parallel and Distributed Computing, HPDC 2021 |
|---|---|
| Country/Territory | Sweden |
| City | Virtual, Online |
| Period | 06/21/21 → 06/25/21 |
Funding
We gratefully acknowledge the computing resources provided and operated by the Research Computing Center at the University of Chicago; and the Joint Laboratory for System Evaluation (JLSE) and the Advanced Leadership Computing Facility (ALCF) at Argonne National Laboratory. This work was performed under financial award 70NANB19H005 from the U.S. Dept. of Commerce, National Institute of Standards and Technology as part of the Center for Hierarchical Materials Design (CHiMaD). This work is supported in part by the National Science Foundation under Grant No. 2004894.
Keywords
- files
- metadata extraction
- search index
- serverless
- storage