TY - GEN
T1 - A vision for managing extreme-scale data hoards
AU - Logan, Jeremy
AU - Mehta, Kshitij
AU - Heber, Gerd
AU - Klasky, Scott
AU - Kurc, Tahsin
AU - Podhorszki, Norbert
AU - Widener, Patrick
AU - Wolf, Matthew
N1 - Publisher Copyright:
© 2019 IEEE.
PY - 2019/7
Y1 - 2019/7
N2 - Scientific data collections grow ever larger, both in terms of the size of individual data items and of the number and complexity of items. To use and manage them, it is important to directly address issues of robust and actionable provenance. We identify three key drivers as our focus: managing the size and complexity of metadata, lack of a priori information to match usage intents between publishers and consumers of data, and support for campaigns over collections of data driven by multi-disciplinary, collaborating teams. We introduce the Hoarde abstraction as an attempt to formalize a way of looking at collections of data to make them more tractable for later use. Hoarde leverages middleware and systems infrastructures for scientific and technical data management. Through the lens of a select group of challenging data usage scenarios, we discuss some of the aspects of implementation, usage, and forward portability of this new view on data management.
AB - Scientific data collections grow ever larger, both in terms of the size of individual data items and of the number and complexity of items. To use and manage them, it is important to directly address issues of robust and actionable provenance. We identify three key drivers as our focus: managing the size and complexity of metadata, lack of a priori information to match usage intents between publishers and consumers of data, and support for campaigns over collections of data driven by multi-disciplinary, collaborating teams. We introduce the Hoarde abstraction as an attempt to formalize a way of looking at collections of data to make them more tractable for later use. Hoarde leverages middleware and systems infrastructures for scientific and technical data management. Through the lens of a select group of challenging data usage scenarios, we discuss some of the aspects of implementation, usage, and forward portability of this new view on data management.
KW - Data provenance
KW - Metadata management
KW - Reproducibility
KW - Scientific data management
UR - http://www.scopus.com/inward/record.url?scp=85074827212&partnerID=8YFLogxK
U2 - 10.1109/ICDCS.2019.00179
DO - 10.1109/ICDCS.2019.00179
M3 - Conference contribution
AN - SCOPUS:85074827212
T3 - Proceedings - International Conference on Distributed Computing Systems
SP - 1806
EP - 1817
BT - Proceedings - 2019 39th IEEE International Conference on Distributed Computing Systems, ICDCS 2019
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 39th IEEE International Conference on Distributed Computing Systems, ICDCS 2019
Y2 - 7 July 2019 through 9 July 2019
ER -