TY - GEN
T1 - Provenance-aware workflow for data quality management and improvement for large continuous scientific data streams
AU - Kumar, Jitendra
AU - Crow, Michael C.
AU - Devarakonda, Ranjeet
AU - Giansiracusa, Michael
AU - Guntupally, Kavya
AU - Olatt, Joseph V.
AU - Price, Zach
AU - Shanafield, Harold A.
AU - Singh, Alka
N1 - Publisher Copyright:
© 2019 IEEE.
PY - 2019/12
Y1 - 2019/12
N2 - Data quality assessment, management and improvement is an integral part of any big data intensive scientific research to ensure accurate, reliable, and reproducible scientific discoveries. The task of maintaining the quality of data, however, is non-trivial and poses a challenge for a program like the Department of Energy's Atmospheric Radiation Measurement (ARM) that collects data from hundreds of instruments across the world, and distributes thousands of streaming data products that are continuously produced in near-real-time for an archive 1.7 Petabyte in size and growing. In this paper, we present a computational data processing workflow to address the data quality issues via an easy and intuitive web-based portal that allows reporting of any quality issues for any site, facility or instruments at a granularity down to individual variables in the data files. This portal allows instrument specialists and scientists to provide corrective actions in the form of symbolic equations. A parallel processing framework applies the data improvement to a large volume of data in an efficient, parallel environment, while optimizing data transfer and file I/O operations; corrected files are then systematically versioned and archived. A provenance tracking module tracks and records any change made to the data during its entire life cycle which are communicated transparently to the scientific users. Developed in Python using open source technologies, this software architecture enables fast and efficient management and improvement of data in an operational data center environment.
AB - Data quality assessment, management and improvement is an integral part of any big data intensive scientific research to ensure accurate, reliable, and reproducible scientific discoveries. The task of maintaining the quality of data, however, is non-trivial and poses a challenge for a program like the Department of Energy's Atmospheric Radiation Measurement (ARM) that collects data from hundreds of instruments across the world, and distributes thousands of streaming data products that are continuously produced in near-real-time for an archive 1.7 Petabyte in size and growing. In this paper, we present a computational data processing workflow to address the data quality issues via an easy and intuitive web-based portal that allows reporting of any quality issues for any site, facility or instruments at a granularity down to individual variables in the data files. This portal allows instrument specialists and scientists to provide corrective actions in the form of symbolic equations. A parallel processing framework applies the data improvement to a large volume of data in an efficient, parallel environment, while optimizing data transfer and file I/O operations; corrected files are then systematically versioned and archived. A provenance tracking module tracks and records any change made to the data during its entire life cycle which are communicated transparently to the scientific users. Developed in Python using open source technologies, this software architecture enables fast and efficient management and improvement of data in an operational data center environment.
KW - atmospheric science
KW - data quality
KW - provenance
KW - scientific data workflows
UR - http://www.scopus.com/inward/record.url?scp=85081367891&partnerID=8YFLogxK
U2 - 10.1109/BigData47090.2019.9006358
DO - 10.1109/BigData47090.2019.9006358
M3 - Conference contribution
AN - SCOPUS:85081367891
T3 - Proceedings - 2019 IEEE International Conference on Big Data, Big Data 2019
SP - 3260
EP - 3266
BT - Proceedings - 2019 IEEE International Conference on Big Data, Big Data 2019
A2 - Baru, Chaitanya
A2 - Huan, Jun
A2 - Khan, Latifur
A2 - Hu, Xiaohua Tony
A2 - Ak, Ronay
A2 - Tian, Yuanyuan
A2 - Barga, Roger
A2 - Zaniolo, Carlo
A2 - Lee, Kisung
A2 - Ye, Yanfang Fanny
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2019 IEEE International Conference on Big Data, Big Data 2019
Y2 - 9 December 2019 through 12 December 2019
ER -