Abstract
The process of identifying a cohort of interest is a very challenging task. It requires manually inspecting many patient records of complex structure that might include medical coding errors and missing data. This paper presents a computational pipeline for refining the process of cohort selection based on medical concepts recorded in the electronic health records (EHRs). The pipeline extracts EHR data for a given cohort and normalizes this data using standard vocabularies. Then a stacked denoising autoencoder is used to embed the normalized patient vectors in a low dimensional space, where the patients are subsequently clustered into sub-cohorts. The goal is to represent the cohort in a standard format and abstract variants of sub-populations. As a use-case, we applied the pipeline to 1.8 million Veterans diagnosed with major depressive disorder (MDD), and identified four meaningful sub-cohorts using the features learned by the autoencoder. Then, each sub-cohort was explored using a set of keywords for interpretation.
Original language | English |
---|---|
Title of host publication | Proceedings - 2020 IEEE 33rd International Symposium on Computer-Based Medical Systems, CBMS 2020 |
Editors | Alba Garcia Seco de Herrera, Alejandro Rodriguez Gonzalez, KC Santosh, Zelalem Temesgen, Bridget Kane, Paolo Soda |
Publisher | Institute of Electrical and Electronics Engineers Inc. |
Pages | 173-176 |
Number of pages | 4 |
ISBN (Electronic) | 9781728194295 |
DOIs | |
State | Published - Jul 2020 |
Event | 33rd IEEE International Symposium on Computer-Based Medical Systems, CBMS 2020 - Virtual, Online, United States Duration: Jul 28 2020 → Jul 30 2020 |
Publication series
Name | Proceedings - IEEE Symposium on Computer-Based Medical Systems |
---|---|
Volume | 2020-July |
ISSN (Print) | 1063-7125 |
Conference
Conference | 33rd IEEE International Symposium on Computer-Based Medical Systems, CBMS 2020 |
---|---|
Country/Territory | United States |
City | Virtual, Online |
Period | 07/28/20 → 07/30/20 |
Funding
ACKNOWLEDGMENT This work is sponsored by the US Department of Veterans Affairs. This research used resources from the Knowledge Discovery Infrastructure at the Oak Ridge National Laboratory, which is supported by the DOE Office of Science under contract DE-AC05-00OR22725. *This manuscript has been authored by UT-Battelle, LLC, under contract DE-AC05-00OR22725 with the US Department of Energy (DOE). The US government retains and the publisher, by accepting the article for publication, acknowledges that the US government retains a nonexclusive, paid-up, irrevocable, worldwide license to publish or reproduce the published form of this manuscript, or allow others to do so, for US government purposes. DOE will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan (http://energy.gov/downloads/doe-public-access-plan).
Keywords
- Clustering
- Cohort selection
- Data normalization
- Electronic health records
- Representation learning
- UMLS