Abstract
Reusing data is difficult even within well-defined science communities and only gets worse when combining data from multiple communities and disciplines. Through the lens of current work on constructing an environmental epidemiological data set from multiple disciplinary sources, we demonstrate the need for a new tool ecosystem to support heterogeneous Big Data science. Extending existing community standards for schemas and/or data formats through human auditing and wrangling of the data is not feasible at scale. This work therefore suggests new approaches for the multi-disciplinary communities to build a shared tool ecosystem for big data. We discuss both the larger context of data wrangling of epidemiological data sets for novel artificial intelligence algorithms and the specific lessons from working with these multi-disciplinary data sets. Adopting a more model-driven, automatable approach promises not only better efficiency but also removes key sources of human-generated errors and promotes reuse and reproducibility of science data.
Original language | English |
---|---|
Title of host publication | Proceedings - 2021 IEEE International Conference on Big Data, Big Data 2021 |
Editors | Yixin Chen, Heiko Ludwig, Yicheng Tu, Usama Fayyad, Xingquan Zhu, Xiaohua Tony Hu, Suren Byna, Xiong Liu, Jianping Zhang, Shirui Pan, Vagelis Papalexakis, Jianwu Wang, Alfredo Cuzzocrea, Carlos Ordonez |
Publisher | Institute of Electrical and Electronics Engineers Inc. |
Pages | 3705-3708 |
Number of pages | 4 |
ISBN (Electronic) | 9781665439022 |
DOIs | |
State | Published - 2021 |
Event | 2021 IEEE International Conference on Big Data, Big Data 2021 - Virtual, Online, United States Duration: Dec 15 2021 → Dec 18 2021 |
Publication series
Name | Proceedings - 2021 IEEE International Conference on Big Data, Big Data 2021 |
---|
Conference
Conference | 2021 IEEE International Conference on Big Data, Big Data 2021 |
---|---|
Country/Territory | United States |
City | Virtual, Online |
Period | 12/15/21 → 12/18/21 |
Funding
This work was supported by the Office of Biological and Environmental Research’s (BER), Biological Systems Science Division (BSSD). This manuscript has been authored by UT-Battelle, LLC under Contract No. DE-AC05-00OR22725 with the U.S. Department of Energy. The United States Government retains and the publisher, by accepting the article for publication, acknowledges that the United States Government retains a non-exclusive, paid-up, irrevocable, world-wide license to publish or reproduce the published form of this manuscript, or allow others to do so, for United States Government purposes. The Department of Energy will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan (http://energy.gov/downloads/doe-public-access-plan).
Keywords
- data wrangling
- domain-specific modeling
- spatial time-series data