TY - GEN
T1 - Real-Time Discovery Services over Large, Heterogeneous and Complex Healthcare Datasets Using Schema-Less, Column-Oriented Methods
AU - Begoli, Edmon
AU - Dunning, Ted
AU - Frasure, Charlie
N1 - Publisher Copyright:
© 2016 IEEE.
PY - 2016/5/19
Y1 - 2016/5/19
N2 - We present a service platform for schema-leess exploration of data and discovery of patient-related statistics from healthcare data sets. The architecture of this platform is motivated by the need for fast, schema-less, and flexible approaches to SQL-based exploration and discovery of information embedded in the common, heterogeneously structured healthcare data sets and supporting components (electronic health records, practice management systems, etc.) The motivating use cases described in the paper are clinical trials candidate discovery, and a treatment effectiveness analysis. Following the use cases, we discuss the key features and software architecture of the platform, the underlying core components (Apache Parquet, Drill, the web services server), and the runtime profiles and performance characteristics of the platform. We conclude by showing dramatic speedup with some approaches, and the performance tradeoffs and limitations of others.
AB - We present a service platform for schema-leess exploration of data and discovery of patient-related statistics from healthcare data sets. The architecture of this platform is motivated by the need for fast, schema-less, and flexible approaches to SQL-based exploration and discovery of information embedded in the common, heterogeneously structured healthcare data sets and supporting components (electronic health records, practice management systems, etc.) The motivating use cases described in the paper are clinical trials candidate discovery, and a treatment effectiveness analysis. Following the use cases, we discuss the key features and software architecture of the platform, the underlying core components (Apache Parquet, Drill, the web services server), and the runtime profiles and performance characteristics of the platform. We conclude by showing dramatic speedup with some approaches, and the performance tradeoffs and limitations of others.
KW - Apache Drill
KW - Apache Parquet
KW - Column Oriented Stores
KW - Data Analysis
KW - Data Cyclone
KW - Healthcare
KW - Schema-less data management
KW - Services Platform
UR - http://www.scopus.com/inward/record.url?scp=84973649629&partnerID=8YFLogxK
U2 - 10.1109/BigDataService.2016.29
DO - 10.1109/BigDataService.2016.29
M3 - Conference contribution
AN - SCOPUS:84973649629
T3 - Proceedings - 2016 IEEE 2nd International Conference on Big Data Computing Service and Applications, BigDataService 2016
SP - 257
EP - 264
BT - Proceedings - 2016 IEEE 2nd International Conference on Big Data Computing Service and Applications, BigDataService 2016
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2nd IEEE International Conference on Big Data Computing Service and Applications, BigDataService 2016
Y2 - 29 March 2016 through 1 April 2016
ER -