TY - GEN
T1 - COVID-19 Data Curation Effort: An Initial Analysis of the Data
AU - Piburn, Jesse
AU - Stewart, Robert
AU - Kaufman, Jason
AU - Sorokine, Alexandre
AU - Axley, David
PY - 2020
Y1 - 2020
N2 - During the COVID-19 pandemic of 2020, major case reporting outlets quickly coalesced around two or three primary vendors. Johns Hopkins University and The New York Times were among the more prominent, and all were of great value to the nation, particularly during the uncertain early stages of the pandemic. They primarily focused on three major attributes: number of new cases, deaths, and recovery. Recognizing that many states were reporting very detailed data sets (e.g., hospital beds) at a county level or finer, the ORNL Pandemic Modeling team embarked on a major data curation effort from March to June 2020 for the purpose of capturing this wealth of detailed data. The challenge of curating this data was daunting. The number of attributes reported by the states grew on almost on a weekly basis. States were routinely shifting their web tool strategies away from easily parsable HTML-based formatting to new Tableau and ArcGIS content. This growth in the sheer number of attributes, combined with the unpredictable shifts in data format, meant an aggressive and agile combination of automated scripting and manual scraping was required to capture new daily streams. Further, the team had to scale up staff and widen its approach for capture and storage. As a result, the team collected more than 11 million data points. Following the close of this data collection effort on June 30th, 2020, the team embarked on a major effort to appraise what had been collected, including an inventory list, spatial completeness, temporal completeness, scale and geographic characteristics, and a determination. A report on this matter was submitted on September 15th, 2020, titled “DOE COVID-19 Data Curation Effort: Overview of Data Collection Coverage”. Over 2000 unique attributes had been netted over a wide range of spatial scales, including state, county, zip codes, health regions, and census blocks. Over 11 million individual data points were collected across these attributes, and spatial coverage (in total) included all 50 states and multiple territories. What became apparent in the process is that in the absence of any data standards, many states reported a wide variety of unique attributes that were not always compatible with attributes reported in other states. As time continued, states began adding new attributes and offering finer grain detail in some older attributes. This meant that not all data streams existed for the entire time period; in fact, the number tended to increase dramatically towards the end. Often, states would begin an attribute series and then stop altogether. These highly variable and uncertain conditions illuminated the need for harmonization approaches that would reconcile and conflate changing attribute names and detail over time. For example, grouping racial data reported as either Black or African American, depending on the state, into a single harmonized attribute. These choices would make a within-state analysis possible during the time period and lead to potential between-state analytics later on. This was almost entirely a manual decision process, requiring some subjective decision-making at times, to prevent a fragmented, short-lived collection of time series fragments that would offer few insights into trends, patterns, and correlates. This report imports harmonized data for state and county into the World Spatio-Temporal Analytics and Mapping Project (WSTAMP). WSTAMP is a major space-time analysis and visualization tool developed at ORNL for the National Geospatial-Intelligence Agency specifically for this kind of exploratory analysis. WSTAMP offers a rich analytical and graphical environment consisting of a wide range of analytics. These include time series plots, statistical summaries, data mining techniques, trend and pattern detection, and hypothesis generation.
AB - During the COVID-19 pandemic of 2020, major case reporting outlets quickly coalesced around two or three primary vendors. Johns Hopkins University and The New York Times were among the more prominent, and all were of great value to the nation, particularly during the uncertain early stages of the pandemic. They primarily focused on three major attributes: number of new cases, deaths, and recovery. Recognizing that many states were reporting very detailed data sets (e.g., hospital beds) at a county level or finer, the ORNL Pandemic Modeling team embarked on a major data curation effort from March to June 2020 for the purpose of capturing this wealth of detailed data. The challenge of curating this data was daunting. The number of attributes reported by the states grew on almost on a weekly basis. States were routinely shifting their web tool strategies away from easily parsable HTML-based formatting to new Tableau and ArcGIS content. This growth in the sheer number of attributes, combined with the unpredictable shifts in data format, meant an aggressive and agile combination of automated scripting and manual scraping was required to capture new daily streams. Further, the team had to scale up staff and widen its approach for capture and storage. As a result, the team collected more than 11 million data points. Following the close of this data collection effort on June 30th, 2020, the team embarked on a major effort to appraise what had been collected, including an inventory list, spatial completeness, temporal completeness, scale and geographic characteristics, and a determination. A report on this matter was submitted on September 15th, 2020, titled “DOE COVID-19 Data Curation Effort: Overview of Data Collection Coverage”. Over 2000 unique attributes had been netted over a wide range of spatial scales, including state, county, zip codes, health regions, and census blocks. Over 11 million individual data points were collected across these attributes, and spatial coverage (in total) included all 50 states and multiple territories. What became apparent in the process is that in the absence of any data standards, many states reported a wide variety of unique attributes that were not always compatible with attributes reported in other states. As time continued, states began adding new attributes and offering finer grain detail in some older attributes. This meant that not all data streams existed for the entire time period; in fact, the number tended to increase dramatically towards the end. Often, states would begin an attribute series and then stop altogether. These highly variable and uncertain conditions illuminated the need for harmonization approaches that would reconcile and conflate changing attribute names and detail over time. For example, grouping racial data reported as either Black or African American, depending on the state, into a single harmonized attribute. These choices would make a within-state analysis possible during the time period and lead to potential between-state analytics later on. This was almost entirely a manual decision process, requiring some subjective decision-making at times, to prevent a fragmented, short-lived collection of time series fragments that would offer few insights into trends, patterns, and correlates. This report imports harmonized data for state and county into the World Spatio-Temporal Analytics and Mapping Project (WSTAMP). WSTAMP is a major space-time analysis and visualization tool developed at ORNL for the National Geospatial-Intelligence Agency specifically for this kind of exploratory analysis. WSTAMP offers a rich analytical and graphical environment consisting of a wide range of analytics. These include time series plots, statistical summaries, data mining techniques, trend and pattern detection, and hypothesis generation.
KW - 59 BASIC BIOLOGICAL SCIENCES
KW - 97 MATHEMATICS AND COMPUTING
U2 - 10.2172/1840156
DO - 10.2172/1840156
M3 - Technical Report
CY - United States
ER -