Extraction of tumor site from cancer pathology reports using deep filters

Abhishek K. Dubey, Jacob Hinkle, J. Blair Christian, Georgia Tourassi

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

2 Scopus citations

Abstract

Purpose: Pathology reports are the primary source of information concerning the millions of cancer cases across the United States. Cancer registries manually process the pathology reports to extract the pertinent information including primary tumor site, behavior, histology, laterality, and grade. Processing a large volume of the pathology reports in a timely manner is a continuing challenge for cancer registries. The purpose of this study is to develop an information extraction pipeline to reliably and efficiently extract reportable information. Method: We have developed a novel inverse-regression (IR) based information extraction pipeline. The inverse-regression based supervised filter has been successfully applied to many application domains. However, its application to the information extraction from unstructured text is hindered primarily by the extreme high-dimensionality of n-gram representations of text. In this study, we attempt to overcome the obstacles by a novel bootstrapping strategy. First, we use an information-theoretic mutual information based filter to discard the excessive and redundant n-gram features. This step reduces the size and potentially improves the condition number of the sample covariance matrix, thus reducing the computational cost and improving the numerical stability of the subsequent inverse-regression step. Then we use localized sliced inverse-regression (LSIR) to learn a low-dimensional discriminatory subspace for information inference. In particular, we use the k-nearest neighbors of an unlabeled pathology report in the learned representation to infer the desired information from the labeled data in a supervised manner. Results: The experiments were conducted on a set of de-identified pathology reports with human expert labels as the ground truth. Our pipeline consistently performed better than or comparable to the best performing state-of-the-art methods while reducing the training and inference times substantially. Conclusion: Our results demonstrate the potential of inverse-regression based information extraction pipeline for reliable and efficient information extraction from unstructured text. The information extracted from the pathology reports can be used along with clinical information, medical imaging, and genomic information to instigate discoveries in cancer research.

Original languageEnglish
Title of host publicationACM-BCB 2019 - Proceedings of the 10th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics
PublisherAssociation for Computing Machinery, Inc
Pages320-327
Number of pages8
ISBN (Electronic)9781450366663
DOIs
StatePublished - Sep 4 2019
Event10th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics, ACM-BCB 2019 - Niagara Falls, United States
Duration: Sep 7 2019Sep 10 2019

Publication series

NameACM-BCB 2019 - Proceedings of the 10th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics

Conference

Conference10th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics, ACM-BCB 2019
Country/TerritoryUnited States
CityNiagara Falls
Period09/7/1909/10/19

Funding

This work has been supported in part by the Joint Design of Advanced Computing Solutions for Cancer (JDACS4C) program established by the U.S. Department of Energy (DOE) and the National Cancer Institute (NCI) of the National Institutes of Health. This work was performed under the auspices of the U.S. Department of Energy by Argonne National Laboratory under Contract DE-AC02-06-CH11357, Lawrence Livermore National Laboratory under Contract DEAC52-07NA27344, Los Alamos National Laboratory under Contract DE-AC5206NA25396, and Oak Ridge National Laboratory under Contract DE-AC05-00OR22725.

Fingerprint

Dive into the research topics of 'Extraction of tumor site from cancer pathology reports using deep filters'. Together they form a unique fingerprint.

Cite this