A systemic approach to facilitating reproducibility via federated, end-to-end data management

Dale Stansberry, Suhas Somnath, Gregory Shutt, Mallikarjun Shankar

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

1 Scopus citations

Abstract

Advances in computing infrastructure and instrumentation have accelerated scientific discovery in addition to exploding the data volumes. Unfortunately, the unavailability of equally advanced data management infrastructure has led to ad hoc practices that diminish scientific productivity and exacerbate the reproducibility crisis. We discuss a systemwide solution that supports management needs at every stage of the data lifecycle. At the center of this system is DataFed - a general purpose, scientific data management system that addresses these challenges by federating data storage across facilities with central metadata and provenance management - providing simple and uniform data discovery, access, and collaboration capabilities.At the edge is a Data Gateway that captures raw data and context from experiments (even when performed on off-network instruments) into DataFed. DataFed can be integrated into analytics platforms to easily, correctly, and reliablyworkwith datasets to improve reproducibility of such workloads.We believe that this system can significantly alleviate the burden of data management and improve compliance with the Findable Accessible Interoperable, Reusable (FAIR) data principles, thereby improving scientific productivity and rigor.

Original languageEnglish
Title of host publicationDriving Scientific and Engineering Discoveries Through the Convergence of HPC, Big Data and AI - 17th Smoky Mountains Computational Sciences and Engineering Conference, SMC 2020, Revised Selected Papers
EditorsJeffrey Nichols, Arthur ‘Barney’ Maccabe, Suzanne Parete-Koon, Becky Verastegui, Oscar Hernandez, Theresa Ahearn
PublisherSpringer Science and Business Media Deutschland GmbH
Pages83-98
Number of pages16
ISBN (Print)9783030633929
DOIs
StatePublished - 2021
Event17th Smoky Mountains Computational Sciences and Engineering Conference, SMC 2020 - Virtual, Online
Duration: Aug 26 2020Aug 28 2020

Publication series

NameCommunications in Computer and Information Science
Volume1315 CCIS
ISSN (Print)1865-0929
ISSN (Electronic)1865-0937

Conference

Conference17th Smoky Mountains Computational Sciences and Engineering Conference, SMC 2020
CityVirtual, Online
Period08/26/2008/28/20

Funding

Acknowledgments. This research used resources of the Oak Ridge Leadership Computing Facility (OLCF) and of the Compute and Data Environment for Science (CADES) at the Oak Ridge National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC05-00OR22725. D. Stansberry et al.—Contributed Equally This manuscript has been co-authored by UT-Battelle, LLC, under contract DE-AC05-00OR22725 with the US Department of Energy (DOE). The US government retains and the publisher, by accepting the article for publication, acknowledges that the US government retains a nonexclusive, paid-up, irrevocable, worldwide license to publish or reproduce the published form of this manuscript, or allow others to do so, for US government purposes. DOE will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan (http://energy. gov/downloads/doe-public-access-plan).

Fingerprint

Dive into the research topics of 'A systemic approach to facilitating reproducibility via federated, end-to-end data management'. Together they form a unique fingerprint.

Cite this