TY - GEN
T1 - Characterizing sub-cohorts via data normalization and representation learning
AU - Rush, Everett
AU - Ozmen, Ozgur
AU - Knight, Kathryn
AU - Park, Byung
AU - Baker, Clifton
AU - Jones, Makoto
AU - Ward, Merry
AU - Nebeker, Jonathan
N1 - Publisher Copyright:
© 2020 IEEE.
PY - 2020/7
Y1 - 2020/7
N2 - The process of identifying a cohort of interest is a very challenging task. It requires manually inspecting many patient records of complex structure that might include medical coding errors and missing data. This paper presents a computational pipeline for refining the process of cohort selection based on medical concepts recorded in the electronic health records (EHRs). The pipeline extracts EHR data for a given cohort and normalizes this data using standard vocabularies. Then a stacked denoising autoencoder is used to embed the normalized patient vectors in a low dimensional space, where the patients are subsequently clustered into sub-cohorts. The goal is to represent the cohort in a standard format and abstract variants of sub-populations. As a use-case, we applied the pipeline to 1.8 million Veterans diagnosed with major depressive disorder (MDD), and identified four meaningful sub-cohorts using the features learned by the autoencoder. Then, each sub-cohort was explored using a set of keywords for interpretation.
AB - The process of identifying a cohort of interest is a very challenging task. It requires manually inspecting many patient records of complex structure that might include medical coding errors and missing data. This paper presents a computational pipeline for refining the process of cohort selection based on medical concepts recorded in the electronic health records (EHRs). The pipeline extracts EHR data for a given cohort and normalizes this data using standard vocabularies. Then a stacked denoising autoencoder is used to embed the normalized patient vectors in a low dimensional space, where the patients are subsequently clustered into sub-cohorts. The goal is to represent the cohort in a standard format and abstract variants of sub-populations. As a use-case, we applied the pipeline to 1.8 million Veterans diagnosed with major depressive disorder (MDD), and identified four meaningful sub-cohorts using the features learned by the autoencoder. Then, each sub-cohort was explored using a set of keywords for interpretation.
KW - Clustering
KW - Cohort selection
KW - Data normalization
KW - Electronic health records
KW - Representation learning
KW - UMLS
UR - http://www.scopus.com/inward/record.url?scp=85091133752&partnerID=8YFLogxK
U2 - 10.1109/CBMS49503.2020.00040
DO - 10.1109/CBMS49503.2020.00040
M3 - Conference contribution
AN - SCOPUS:85091133752
T3 - Proceedings - IEEE Symposium on Computer-Based Medical Systems
SP - 173
EP - 176
BT - Proceedings - 2020 IEEE 33rd International Symposium on Computer-Based Medical Systems, CBMS 2020
A2 - de Herrera, Alba Garcia Seco
A2 - Rodriguez Gonzalez, Alejandro
A2 - Santosh, KC
A2 - Temesgen, Zelalem
A2 - Kane, Bridget
A2 - Soda, Paolo
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 33rd IEEE International Symposium on Computer-Based Medical Systems, CBMS 2020
Y2 - 28 July 2020 through 30 July 2020
ER -