Abstract
Data services such as search, discovery, and management in scalable distributed environments have traditionally been decoupled from the underlying file systems, and are often deployed using external databases and indexing services. However, modern data production rates, looming data movement costs, and the lack of metadata, entail revisiting the decoupled file system-data services design philosophy. In this article, we present TagIt, a scalable data management service framework aimed at scientific datasets, which can be integrated into prevalent distributed file system architectures. A key feature of TagIt is a scalable, distributed metadata indexing framework, which facilitates a flexible tagging capability to support data discovery. Furthermore, the tags can also be associated with an active operator, for pre-processing, filtering, or automatic metadata extraction, which we seamlessly offload to file servers in a load-aware fashion. We have integrated TagIt into two popular distributed file systems, i.e., GlusterFS and CephFS. Our evaluation demonstrates that TagIt can expedite data search operation by up to 10× over the extant decoupled approach.
Original language | English |
---|---|
Article number | 9079563 |
Pages (from-to) | 2375-2391 |
Number of pages | 17 |
Journal | IEEE Transactions on Parallel and Distributed Systems |
Volume | 31 |
Issue number | 10 |
DOIs | |
State | Published - Oct 1 2020 |
Funding
This research was supported in part by the U.S. DOE’s Scientific data management program, by US National Science Foundation through Grants CNS-1615411, CNS-1405697 and CNS-1565314, and by the National Research Foundation of Korea (NRF) grant funded by the Korea Government (Ministry of Science and ICT) under Grant 2018R1A1A1A05079398. The work was also supported by, and used the resources of, the Oak Ridge Leadership Computing Facility, located in the National Center for Computational Sciences at ORNL, which is managed by UT Battelle, LLC for the U.S. DOE (under the contract No. DE-AC05-00OR22725).
Funders | Funder number |
---|---|
Ministry of Science and ICT | 2018R1A1A1A05079398 |
U.S. DOE | |
National Science Foundation | CNS-1615411, CNS-1565314, CNS-1405697 |
U.S. Department of Energy | |
National Research Foundation of Korea |
Keywords
- Distributed systems
- scientific data management
- storage management