Abstract
Pathology reports are a main source of data for cancer surveillance programs. Manual coding of pathology reports is labor-intensive but necessary for obtaining labeled data to train automated information extraction systems. In this study, we investigated semi-supervised deep learning, improving the performance of a multitask information extraction system for automated annotation of pathology reports. We used a set of over 374,000 pathology reports from the Louisiana Tumor Registry and a novel convolutional attention-based auto-encoder. We performed a set of experiments comparing supervised training augmented with unlabeled data at 1%, 5%, 10%, and 50% of the original data size. We also compared the impact of extending text processing to include unlabeled tokens. We find that semi-supervised training consistently improved individual performance with increased micro-Averaged F-scores between 0.012 and 0.064 and increased macro-Averaged F-scores of up to 0.158. This demonstrates that semantic information learned via unsupervised learning can be used to improve supervised clinical task performance.
Original language | English |
---|---|
Title of host publication | 2019 IEEE EMBS International Conference on Biomedical and Health Informatics, BHI 2019 - Proceedings |
Publisher | Institute of Electrical and Electronics Engineers Inc. |
ISBN (Electronic) | 9781728108483 |
DOIs | |
State | Published - May 2019 |
Event | 2019 IEEE EMBS International Conference on Biomedical and Health Informatics, BHI 2019 - Chicago, United States Duration: May 19 2019 → May 22 2019 |
Publication series
Name | 2019 IEEE EMBS International Conference on Biomedical and Health Informatics, BHI 2019 - Proceedings |
---|
Conference
Conference | 2019 IEEE EMBS International Conference on Biomedical and Health Informatics, BHI 2019 |
---|---|
Country/Territory | United States |
City | Chicago |
Period | 05/19/19 → 05/22/19 |
Funding
This manuscript has been authored by UT-Battelle, LLC under Contract No. DE-AC05-00OR22725 with the U.S. Department of Energy. The United States Government retains and the publisher, by accepting the article for publication, acknowledges that the United States Government retains a non-exclusive, paid-up, irrevocable, world-wide license to publish or reproduce the published form of the manuscript, or allow others to do so, for United States Government purposes. The Department of Energy will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan (http://energy.gov/downloads/doe-public-access-plan).
Keywords
- Autoencoder
- Convolutional neural network
- Natural language processing
- Semi-supervised learning