A vision for managing extreme-scale data hoards

Jeremy Logan, Kshitij Mehta, Gerd Heber, Scott Klasky, Tahsin Kurc, Norbert Podhorszki, Patrick Widener, Matthew Wolf

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

4 Scopus citations

Abstract

Scientific data collections grow ever larger, both in terms of the size of individual data items and of the number and complexity of items. To use and manage them, it is important to directly address issues of robust and actionable provenance. We identify three key drivers as our focus: managing the size and complexity of metadata, lack of a priori information to match usage intents between publishers and consumers of data, and support for campaigns over collections of data driven by multi-disciplinary, collaborating teams. We introduce the Hoarde abstraction as an attempt to formalize a way of looking at collections of data to make them more tractable for later use. Hoarde leverages middleware and systems infrastructures for scientific and technical data management. Through the lens of a select group of challenging data usage scenarios, we discuss some of the aspects of implementation, usage, and forward portability of this new view on data management.

Original languageEnglish
Title of host publicationProceedings - 2019 39th IEEE International Conference on Distributed Computing Systems, ICDCS 2019
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages1806-1817
Number of pages12
ISBN (Electronic)9781728125190
DOIs
StatePublished - Jul 2019
Event39th IEEE International Conference on Distributed Computing Systems, ICDCS 2019 - Richardson, United States
Duration: Jul 7 2019Jul 9 2019

Publication series

NameProceedings - International Conference on Distributed Computing Systems
Volume2019-July

Conference

Conference39th IEEE International Conference on Distributed Computing Systems, ICDCS 2019
Country/TerritoryUnited States
CityRichardson
Period07/7/1907/9/19

Funding

Without the continued support from the Department of Energy’s Office of Advanced Scientific Computing Research, the projects upon which this future vision rests, including SIRIUS, MONA, and SENSEI, would not be possible. Additionally, support from the DOE computing facilities in Oak Ridge and NERSC, as well as the National Science Foundation, was also critical. This work was also supported in part by 1U24CA180924-01A1, 3U24CA215109-02, and 1UG3CA225021-01 from the National Cancer Institute, R01LM011119-01 and R01LM009239 from the U.S. National Library of Medicine.

FundersFunder number
SENSEI
National Science Foundation
U.S. Department of Energy
National Cancer InstituteR01LM009239, R01LM011119-01
U.S. National Library of Medicine
Advanced Scientific Computing Research
Oak Ridge Associated Universities

    Keywords

    • Data provenance
    • Metadata management
    • Reproducibility
    • Scientific data management

    Fingerprint

    Dive into the research topics of 'A vision for managing extreme-scale data hoards'. Together they form a unique fingerprint.

    Cite this