An Integrated Indexing and Search Service for Distributed File Systems

Hyogi Sim, Awais Khan, Sudharshan S. Vazhkudai, Seung Hwan Lim, Ali R. Butt, Youngjae Kim

Research output: Contribution to journalArticlepeer-review

15 Scopus citations

Abstract

Data services such as search, discovery, and management in scalable distributed environments have traditionally been decoupled from the underlying file systems, and are often deployed using external databases and indexing services. However, modern data production rates, looming data movement costs, and the lack of metadata, entail revisiting the decoupled file system-data services design philosophy. In this article, we present TagIt, a scalable data management service framework aimed at scientific datasets, which can be integrated into prevalent distributed file system architectures. A key feature of TagIt is a scalable, distributed metadata indexing framework, which facilitates a flexible tagging capability to support data discovery. Furthermore, the tags can also be associated with an active operator, for pre-processing, filtering, or automatic metadata extraction, which we seamlessly offload to file servers in a load-aware fashion. We have integrated TagIt into two popular distributed file systems, i.e., GlusterFS and CephFS. Our evaluation demonstrates that TagIt can expedite data search operation by up to 10× over the extant decoupled approach.

Original languageEnglish
Article number9079563
Pages (from-to)2375-2391
Number of pages17
JournalIEEE Transactions on Parallel and Distributed Systems
Volume31
Issue number10
DOIs
StatePublished - Oct 1 2020

Funding

This research was supported in part by the U.S. DOE’s Scientific data management program, by US National Science Foundation through Grants CNS-1615411, CNS-1405697 and CNS-1565314, and by the National Research Foundation of Korea (NRF) grant funded by the Korea Government (Ministry of Science and ICT) under Grant 2018R1A1A1A05079398. The work was also supported by, and used the resources of, the Oak Ridge Leadership Computing Facility, located in the National Center for Computational Sciences at ORNL, which is managed by UT Battelle, LLC for the U.S. DOE (under the contract No. DE-AC05-00OR22725).

FundersFunder number
Ministry of Science and ICT2018R1A1A1A05079398
U.S. DOE
National Science FoundationCNS-1615411, CNS-1565314, CNS-1405697
U.S. Department of Energy
National Research Foundation of Korea

    Keywords

    • Distributed systems
    • scientific data management
    • storage management

    Fingerprint

    Dive into the research topics of 'An Integrated Indexing and Search Service for Distributed File Systems'. Together they form a unique fingerprint.

    Cite this