Provenance-aware workflow for data quality management and improvement for large continuous scientific data streams

Jitendra Kumar, Michael C. Crow, Ranjeet Devarakonda, Michael Giansiracusa, Kavya Guntupally, Joseph V. Olatt, Zach Price, Harold A. Shanafield, Alka Singh

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

3 Scopus citations

Abstract

Data quality assessment, management and improvement is an integral part of any big data intensive scientific research to ensure accurate, reliable, and reproducible scientific discoveries. The task of maintaining the quality of data, however, is non-trivial and poses a challenge for a program like the Department of Energy's Atmospheric Radiation Measurement (ARM) that collects data from hundreds of instruments across the world, and distributes thousands of streaming data products that are continuously produced in near-real-time for an archive 1.7 Petabyte in size and growing. In this paper, we present a computational data processing workflow to address the data quality issues via an easy and intuitive web-based portal that allows reporting of any quality issues for any site, facility or instruments at a granularity down to individual variables in the data files. This portal allows instrument specialists and scientists to provide corrective actions in the form of symbolic equations. A parallel processing framework applies the data improvement to a large volume of data in an efficient, parallel environment, while optimizing data transfer and file I/O operations; corrected files are then systematically versioned and archived. A provenance tracking module tracks and records any change made to the data during its entire life cycle which are communicated transparently to the scientific users. Developed in Python using open source technologies, this software architecture enables fast and efficient management and improvement of data in an operational data center environment.

Original languageEnglish
Title of host publicationProceedings - 2019 IEEE International Conference on Big Data, Big Data 2019
EditorsChaitanya Baru, Jun Huan, Latifur Khan, Xiaohua Tony Hu, Ronay Ak, Yuanyuan Tian, Roger Barga, Carlo Zaniolo, Kisung Lee, Yanfang Fanny Ye
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages3260-3266
Number of pages7
ISBN (Electronic)9781728108582
DOIs
StatePublished - Dec 2019
Event2019 IEEE International Conference on Big Data, Big Data 2019 - Los Angeles, United States
Duration: Dec 9 2019Dec 12 2019

Publication series

NameProceedings - 2019 IEEE International Conference on Big Data, Big Data 2019

Conference

Conference2019 IEEE International Conference on Big Data, Big Data 2019
Country/TerritoryUnited States
CityLos Angeles
Period12/9/1912/12/19

Funding

This research was supported by the Atmospheric Radiation Measurement (ARM) user facility, a U.S. Department of Energy (DOE) Office of Science user facility managed by the Office of Biological and Environmental Research. This research used resources of the Oak Ridge Leadership Computing Facility, which is a DOE Office of Science User Facility supported under Contract DE-AC05-00OR22725. This manuscript has been authored by UT-Battelle, LLC under Contract No. DE-AC05-00OR22725 with the U.S. Department of Energy. The United States Government retains and the publisher, by accepting the article for publication, acknowledges that the United States Government retains a non-exclusive, paid-up, irrevocable, world-wide license to publish or reproduce the published form of this manuscript, or allow others to do so, for United States Government purposes. The Department of Energy will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan (http://energy.gov/downloads/doe-public-access-plan).

FundersFunder number
DOE Office of ScienceDE-AC05-00OR22725
Office of Biological and Environmental Research
U.S. Department of Energy
Office of Science

    Keywords

    • atmospheric science
    • data quality
    • provenance
    • scientific data workflows

    Fingerprint

    Dive into the research topics of 'Provenance-aware workflow for data quality management and improvement for large continuous scientific data streams'. Together they form a unique fingerprint.

    Cite this