Models and Metrics for Mining Meaningful Metadata

Tyler J. Skluzacek, Matthew Chen, Erica Hsu, Kyle Chard, Ian Foster

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

1 Scopus citations

Abstract

The increasing volume and variety of science data has led to the creation of metadata extraction systems that automatically derive and synthesize relevant information from files. A critical component of metadata extraction systems is a mechanism for mapping extractors—lightweight tools to mine information from a particular file types—to each file in a repository. However, existing methods do little to address the heterogeneity and scale of science data, thereby leaving valuable data unextracted or wasting significant compute resources applying incorrect extractors to data. We construct an extractor scheduler that leverages file type identification (FTI) methods. We show that by training lightweight multi-label, multi-class statistical models on byte samples from files, we can correctly map 35% more extractors to files than by using libmagic. Further, we introduce a metadata quality toolkit to automatically assess the utility of extracted metadata.

Original languageEnglish
Title of host publicationComputational Science - ICCS 2022, 22nd International Conference, Proceedings
EditorsDerek Groen, Clélia de Mulatier, Valeria V. Krzhizhanovskaya, Peter M.A. Sloot, Maciej Paszynski, Jack J. Dongarra
PublisherSpringer Science and Business Media Deutschland GmbH
Pages417-430
Number of pages14
ISBN (Print)9783031087509
DOIs
StatePublished - 2022
Externally publishedYes
Event22nd Annual International Conference on Computational Science, ICCS 2022 - London, United Kingdom
Duration: Jun 21 2022Jun 23 2022

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume13350 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference22nd Annual International Conference on Computational Science, ICCS 2022
Country/TerritoryUnited Kingdom
CityLondon
Period06/21/2206/23/22

Funding

Acknowledgements. We gratefully acknowledge Takuya Kurihana (University of Chicago) for sharing his machine learning expertise. This work is supported in part by the National Science Foundation under Grants No. 2004894 and 1757970, and used resources of the Argonne Leadership Computing Facility. We gratefully acknowledge Takuya Kurihana (University of Chicago) for sharing his machine learning expertise. This work is supported in part by the National Science Foundation under Grants No. 2004894 and 1757970, and used resources of the Argonne Leadership Computing Facility.

Keywords

  • Extraction
  • File type identification
  • Metadata quality

Fingerprint

Dive into the research topics of 'Models and Metrics for Mining Meaningful Metadata'. Together they form a unique fingerprint.

Cite this