Abstract
Diverse observational and simulation datasets are needed to understand and predict complex ecosystem behavior over seasonal to decadal and century time-scales. Integration of these datasets poses a major barrier towards advancing environmental science, particularly due to differences in the structure and formats of data provided by various sources. Here, we describe BASIN-3D (Broker for Assimilation, Synthesis and Integration of eNvironmental Diverse, Distributed Datasets), a data integration framework designed to dynamically retrieve and transform heterogeneous data from different sources into a common format to provide an integrated view. BASIN-3D enables users to adopt a standardized approach for data retrieval and avoid customizations for the data type or source. We demonstrate the value of BASIN-3D with two use cases that require integration of data from regional to watershed spatial scales. The first application uses the BASIN-3D Python library to integrate time-series hydrological and meteorological data to provide standardized inputs to analytical and machine learning codes in order to predict the impacts of hydrological disturbances on large river corridors of the United States. The second application uses the BASIN-3D Django framework to integrate diverse time-series data in a mountainous watershed in East River, Colorado, United States to enable scientific researchers to explore and download data through an interactive web portal. Thus, BASIN-3D can be used to support data integration for both web-based tools, as well as data analytics using Python scripting and extensions like Jupyter notebooks. The framework is expected to be transferable to and useful for many other field and modeling studies.
Original language | English |
---|---|
Article number | 105024 |
Journal | Computers and Geosciences |
Volume | 159 |
DOIs | |
State | Published - Feb 2022 |
Externally published | Yes |
Funding
This research is supported as part of the Watershed Function Scientific Focus Area , the iNAIADS DOE Early Career Project, and the Environmental Systems Science Data Infrastructure for a Virtual Ecosystem (ESS-DIVE) funded by the U.S. Department of Energy , Office of Science , Office of Biological and Environmental Research under Award no. DE-AC02-05CH11231 . This research used resources of the National Energy Research Scientific Computing Center (NERSC) , a U.S. Department of Energy Office of Science User Facility operated under Contract No. DE-AC02-05CH11231 . We acknowledge the support of the Watershed SFA team who provided feedback for the scientist-centered design exercises. We also acknowledge the anonymous reviewers whose comments helped improve the manuscript significantly. Data federation, also known as the hub and spoke model (Haas et al., 2002), is an alternate approach that has gained traction. Here, data are left at the original sources and an intermediate brokering software maintains a catalog and retrieves data on demand (Genesereth, 2010; Nativi et al., 2013). This allows users to access the latest version of the data from different sources as though it were available in a central location. The brokering approach has been adopted by systems such as the Group on Earth Observations (GEOSS) Discovery and Access Broker (Nativi et al., 2014) and the related BCube brokering framework (Khalsa, 2017). The US National Groundwater Monitoring Network (https://cida.usgs.gov/ngwmn/index.jsp) uses an advanced brokering approach to synthesize datasets from various sources to support a portal with interactive visualizations; however, this system only handles specific types of groundwater monitoring data (water level, quality, lithology), and requires cooperative agreements with providers to standardize their data to facilitate data exchange (https://acwi.gov/sogw/ngwmn_framework_report_july2013.pdf). One of the more successful implementations of a brokering approach is the Consortium of Universities for the Advancement of Hydrologic Science (CUAHSI) Hydrologic Information System (HIS; https://hiscentral.cuahsi.org), which transforms diverse time-series data using the WaterOneFlow web services into a standardized WaterML format with common variable and unit names from the CUAHSI controlled vocabularies (Horsburgh et al., 2009, 2016). The HIS enables unified access to data synthesized from over 95 providers via the Hydroclient interactive portal. However, the WaterOneFlow web services need to be hosted and maintained by the provider or CUAHSI, which limits its application to data sources that belong to the HIS ecosystem. The HIS system also does not support large data downloads as the Hydroclient limits search and access to 25,000 results. This tends to be problematic for intensive data-driven applications such as ML, where programmatic access to large amounts of data from sources outside of HIS may be needed.BASIN-3D uses constructs from the OGC standard to represent multiscale spatial elements with their location features, associated groupings and hierarchies. In particular, BASIN-3D uses ?Monitoring Feature? entities which inherit components of the OGC entities ?Feature?, ?Sampling Feature?, ?Spatial Sampling Feature? (Appendix 1) [Cox, 2011; Tomkins and Lowe, 2016]. Monitoring Features are classified by a controlled list of Feature Types that represent spatial features at different scales relevant to watershed sciences: ?Region?, ?Subregion?, ?Basin?, ?Subbasin?, ?Watershed?, ?Site?, ?Plot?, ?Horizontal Path?, ?Vertical Path? and ?Point? (e.g. Fig. 2); this list is specific to each BASIN-3D implementation and can be expanded to include additional Feature Types relevant to other disciplines. Monitoring Features, as an extension of Spatial Sampling Features, are geographic entities that have a shape property to describe their spatial geometry as one of four types specified by OGC: ?point? (e.g., point, specimen), ?curve? (e.g., river, well, tower), ?surface? (e.g., river basin, watershed, site, plot), and ?solid? (e.g. lidar cloud). The physical location coordinates of a Monitoring Feature are represented using the Federal Geographic Data Committee (FGDC) data standard (FGDC 1998), which provides support for multiple spatial reference systems including geographic (latitude/longitude), grid (Universal Transverse Mercator), and planar (distance/bearing representation) coordinates. Monitoring Features can be infinitely nested using parent-child spatial relationships. For example, a plot containing multi-level wells will have three types of Monitoring Features defined as a surface (plot) > curve (well) > point (sensors at different depths in the well; e.g. Fig. 2).BASIN-3D also supports diverse observation types with the OGC concepts. It uses a generically-defined ?Observation? entity with three components: the ?Observed Property? describes the measurement (e.g. river discharge, stream chemical concentrations), the ?Feature of interest? defines the subject of observation (e.g. river), and the ?Observation Results? defines the results of the observation using abstracted data structures (e.g., a time-series of coupled timestamps and values). The components of the Observation are linked as follows: Observation Results of an Observed Property are reported for a Feature of interest, typically specified as a representative Monitoring Feature (Section 2.1).The Data Acquisition Layer provides functionality to customize data source connections as required for the application using a plugin architecture containing extensible Python classes. The base Plugin classes enable connection to any network-accessible source such as a database, web service or a remote or local filesystem. They also include an extensible HTTP connection module with support for some common authentication methods such as the Hypertext Transfer Protocol (HTTP) authentication API that supports OAuth2 (https://oauth.net/2/) and token-based authentication. Custom data access plugins consist of 2 components: 1) a python module, and 2) a csv file with a mapping of data source variables to BASIN-3D variables. In the plugin python module, the developer extends the base plugin classes and implements the authentication required by the source (if any), constructs queries to retrieve data and metadata required by BASIN-3D (Table 2), and maps the structure, format and semantics of the returned data to the Synthesis models. In particular, information about measurement locations (via a mapping to Monitoring Feature objects) and time-series data (via a mapping to Measurement TVP Timeseries objects) is configured in the python plugin module. Mapping to the BASIN-3D synthesis models is open-ended and accommodates a range of scenarios depending on the availability of data or metadata from the data source. Plugin developers can choose to return mapped data from the data source, return data from other supplementary local or remote sources, or return nothing if no relevant information is provided by the data source. The WFSFA implementation (Section 3.2) describes additional examples of plugin configuration. Plugins can be shared between the Python library and Django implementations. Currently both versions are bundled with a plugin to the public USGS National Water Information System (NWIS; https://nwis.waterdata.usgs.gov) that can be used out-of-the-box to access the NWIS data and also used as a template to create new plugins to connect to new sources. After the plugins are created, it is trivial to query the BASIN-3D APIs for integrated search and access of data across all configured sources. Any custom data access plugin that extends the BASIN-3D Plugin classes can be registered for use with the Data Synthesis Layer using a simple function call.The project required integration of these diverse data held in different sources to minimize redundant and inconsistent efforts by scientists to retrieve and synthesize data. A critical need was for a software to integrate data from the two SFA private databases that required authentication with data from public sources such as the USGS NWIS and EPA (Fig. 5). Hence, the web version of BASIN-3D was used to support integration across the SFA's East River and Rifle field sites and USGS sites across the East-Taylor Watershed, and to support serving the data through a user-friendly interactive web portal.BASIN-3D has been designed to harmonize, integrate and query diverse datasets that result from a range of field investigations, monitoring networks and model simulations. In particular, use of the OGC and FGDC standards provides a means to support flexible synthesis of diverse measurement configurations and data types using abstracted data structures (Section 4.1). These standards are a suitable choice as they have been developed over several years to enable interoperability across data systems and have achieved consensus across and adoption by various organizations.We encountered some challenges implementing these data standards as the underlying construct for integration. First, it was not easy to use the standards partly because they are specified at a high level and do not provide implementation guidance beyond some simple, limited examples. For example, the OGC standards do not specify implementation of Monitoring Features or its parent features for different shapes or resolve how collections of spatial hierarchies should be organized. Standards also use specialized terminologies that domain scientists may not be familiar with. For BASIN-3D, we had to balance constraints of following the OGC standard using the specified terminology (e.g., using observed_property and measurement_tvp_timeseries), while making the concepts and Synthesis API calls logical to domain scientists. Thus in a few cases we deviated from OGC definitions or terminologies to improve the usability of BASIN-3D for scientific researchers or for other practical reasons. For example, while the OGC standards differentiate between the feature being observed and the representative sampling feature upon which the actual observation is made, BASIN-3D does not make this distinction because most data sources only include information on the Sampling Feature and a ?Feature? in one case may be a ?Sampling Feature? in another. Thus all spatial entities are Monitoring Features in BASIN-3D; however, the data model is implemented as hierarchical classes which enables expansion to support any OGC Feature entity. Similarly, all Feature entities use Feature Types instead of specific, entity-based types (e.g., Spatial Sampling Feature Type) for practical implementation.This research is supported as part of the Watershed Function Scientific Focus Area, the iNAIADS DOE Early Career Project, and the Environmental Systems Science Data Infrastructure for a Virtual Ecosystem (ESS-DIVE) funded by the U.S. Department of Energy, Office of Science, Office of Biological and Environmental Research under Award no. DE-AC02-05CH11231. This research used resources of the National Energy Research Scientific Computing Center (NERSC), a U.S. Department of Energy Office of Science User Facility operated under Contract No. DE-AC02-05CH11231. We acknowledge the support of the Watershed SFA team who provided feedback for the scientist-centered design exercises. We also acknowledge the anonymous reviewers whose comments helped improve the manuscript significantly.
Keywords
- Data integration
- Environmental data
- Multiscale diverse data
- Synthesis