Characterizing sub-cohorts via data normalization and representation learning

Everett Rush, Ozgur Ozmen, Kathryn Knight, Byung Park, Clifton Baker, Makoto Jones, Merry Ward, Jonathan Nebeker

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

The process of identifying a cohort of interest is a very challenging task. It requires manually inspecting many patient records of complex structure that might include medical coding errors and missing data. This paper presents a computational pipeline for refining the process of cohort selection based on medical concepts recorded in the electronic health records (EHRs). The pipeline extracts EHR data for a given cohort and normalizes this data using standard vocabularies. Then a stacked denoising autoencoder is used to embed the normalized patient vectors in a low dimensional space, where the patients are subsequently clustered into sub-cohorts. The goal is to represent the cohort in a standard format and abstract variants of sub-populations. As a use-case, we applied the pipeline to 1.8 million Veterans diagnosed with major depressive disorder (MDD), and identified four meaningful sub-cohorts using the features learned by the autoencoder. Then, each sub-cohort was explored using a set of keywords for interpretation.

Original languageEnglish
Title of host publicationProceedings - 2020 IEEE 33rd International Symposium on Computer-Based Medical Systems, CBMS 2020
EditorsAlba Garcia Seco de Herrera, Alejandro Rodriguez Gonzalez, KC Santosh, Zelalem Temesgen, Bridget Kane, Paolo Soda
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages173-176
Number of pages4
ISBN (Electronic)9781728194295
DOIs
StatePublished - Jul 2020
Event33rd IEEE International Symposium on Computer-Based Medical Systems, CBMS 2020 - Virtual, Online, United States
Duration: Jul 28 2020Jul 30 2020

Publication series

NameProceedings - IEEE Symposium on Computer-Based Medical Systems
Volume2020-July
ISSN (Print)1063-7125

Conference

Conference33rd IEEE International Symposium on Computer-Based Medical Systems, CBMS 2020
Country/TerritoryUnited States
CityVirtual, Online
Period07/28/2007/30/20

Funding

ACKNOWLEDGMENT This work is sponsored by the US Department of Veterans Affairs. This research used resources from the Knowledge Discovery Infrastructure at the Oak Ridge National Laboratory, which is supported by the DOE Office of Science under contract DE-AC05-00OR22725. *This manuscript has been authored by UT-Battelle, LLC, under contract DE-AC05-00OR22725 with the US Department of Energy (DOE). The US government retains and the publisher, by accepting the article for publication, acknowledges that the US government retains a nonexclusive, paid-up, irrevocable, worldwide license to publish or reproduce the published form of this manuscript, or allow others to do so, for US government purposes. DOE will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan (http://energy.gov/downloads/doe-public-access-plan).

FundersFunder number
U.S. Department of Energy
U.S. Department of Veterans Affairs
Office of ScienceDE-AC05-00OR22725

    Keywords

    • Clustering
    • Cohort selection
    • Data normalization
    • Electronic health records
    • Representation learning
    • UMLS

    Fingerprint

    Dive into the research topics of 'Characterizing sub-cohorts via data normalization and representation learning'. Together they form a unique fingerprint.

    Cite this