Abstract
The use and reuse of scientific data is ultimately dependent on the ability to understand what those data represent, how they were captured, and how they can be used. In many ways, data are only as useful as the metadata available to describe them. Unfortunately, due to growing data volumes, large and distributed collaborations, and a desire to store data for long periods of time, scientific “data lakes” quickly become disorganized and lack the metadata necessary to be useful to researchers. New automated approaches are needed to derive metadata from scientific files and to use these metadata for organization and discovery. Here we describe one such system, Xtract, a service capable of processing vast collections of scientific files and automatically extracting metadata from diverse file types. Xtract relies on function as a service models to enable scalable metadata extraction by orchestrating the execution of many, short-running extractor functions. To reduce data transfer costs, Xtract can be configured to deploy extractors centrally or near to the data (i.e., at the edge). We present a prototype implementation of Xtract and demonstrate that it can derive metadata from a 7 TB scientific data repository.
| Original language | English |
|---|---|
| Title of host publication | WOSC 2019 - Proceedings of the 2019 5th International Workshop on Serverless Computing, Part of Middleware 2019 |
| Publisher | Association for Computing Machinery, Inc |
| Pages | 43-48 |
| Number of pages | 6 |
| ISBN (Electronic) | 9781450370387 |
| DOIs | |
| State | Published - Dec 9 2019 |
| Externally published | Yes |
| Event | 5th International Workshop on Serverless Computing, WOSC 2019 - Part of Middleware 2019 - Davis, United States Duration: Dec 9 2019 → Dec 13 2019 |
Publication series
| Name | WOSC 2019 - Proceedings of the 2019 5th International Workshop on Serverless Computing, Part of Middleware 2019 |
|---|
Conference
| Conference | 5th International Workshop on Serverless Computing, WOSC 2019 - Part of Middleware 2019 |
|---|---|
| Country/Territory | United States |
| City | Davis |
| Period | 12/9/19 → 12/13/19 |
Funding
This research used resources of the Argonne Leadership Computing Facility, which is a DOE Office of Science User Facility supported under Contract DE-AC02-06CH11357. We gratefully acknowledge the computing resources provided and operated by the Joint Laboratory for System Evaluation (JLSE) at Argonne National Laboratory as well as the Jetstream cloud for science and engineering [19].
Keywords
- Data lakes
- File systems
- Materials science
- Metadata extraction
- Serverless