Abstract
Deep learning has surged in popularity and proven to be effective for various artificial intelligence applications including information extraction from cancer pathology reports. Since word representation is a core unit that enables deep learning algorithms to understand words and be able to perform NLP, this representation must include as much information as possible to help these algorithms achieve high classification performance. Therefore, in this work in addition to the distributional information of words in large sized corpora, we use UMLS vocabulary resources to enrich the vector space representation of words with the semantic relations between words. These resources provide many terminologies pertaining to cancer. The refined word embeddings are used with a convolutional neural (CNN) model to extract four data elements from cancer pathology reports; ICD-O-3 tumor topography codes, tumor laterality, behavior, and histological grade. We observed that using UMLS vocabulary resources to enrich word embeddings of CNN models consistently outperformed CNN models without pre-training word embeddings and even with pre-trained word embeddings on a domain specific corpus across all four tasks. The results show marginal improvement on the laterality task, but a significant improvement on the other tasks, especially for the macro-f score. Specifically, the improvements are 3%, 13%, and 15% for tumor site, histological grade, and behavior tasks, respectively. This approach is encouraging to enrich word embeddings with more clinical data resources to be used for information abstraction tasks from clinical pathology reports.
Original language | English |
---|---|
Title of host publication | Proceedings - 2018 IEEE International Conference on Big Data, Big Data 2018 |
Editors | Naoki Abe, Huan Liu, Calton Pu, Xiaohua Hu, Nesreen Ahmed, Mu Qiao, Yang Song, Donald Kossmann, Bing Liu, Kisung Lee, Jiliang Tang, Jingrui He, Jeffrey Saltz |
Publisher | Institute of Electrical and Electronics Engineers Inc. |
Pages | 2838-2846 |
Number of pages | 9 |
ISBN (Electronic) | 9781538650356 |
DOIs | |
State | Published - Jul 2 2018 |
Event | 2018 IEEE International Conference on Big Data, Big Data 2018 - Seattle, United States Duration: Dec 10 2018 → Dec 13 2018 |
Publication series
Name | Proceedings - 2018 IEEE International Conference on Big Data, Big Data 2018 |
---|
Conference
Conference | 2018 IEEE International Conference on Big Data, Big Data 2018 |
---|---|
Country/Territory | United States |
City | Seattle |
Period | 12/10/18 → 12/13/18 |
Bibliographical note
Publisher Copyright:© 2018 IEEE.
Funding
This manuscript has been authored by UT - Battelle, LLC under Contract No. DE-AC05-00OR22725 with the U.S. Department of Energy. The United States Government retains and the publisher, by accepting the article for publication, acknowledges that the United States Government retains a nonexclusive, paid-up, irrevocable, world-wide license to publish or reproduce the published form of the manuscript, or allow others to do so, for United States Government purposes. The Department of Energy will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan (http://energy.gov/downloads/doe-public-access-plan).
Funders | Funder number |
---|---|
National Institutes of Health | |
U.S. Department of Energy | |
National Cancer Institute | |
Office of Science |
Keywords
- UMLS
- Word embeddings
- convolutional neural networks
- natural language processing