Abstract
We introduce a deep learning architecture, hierarchical self-attention networks (HiSANs), designed for classifying pathology reports and show how its unique architecture leads to a new state-of-the-art in accuracy, faster training, and clear interpretability. We evaluate performance on a corpus of 374,899 pathology reports obtained from the National Cancer Institute's (NCI) Surveillance, Epidemiology, and End Results (SEER) program. Each pathology report is associated with five clinical classification tasks – site, laterality, behavior, histology, and grade. We compare the performance of the HiSAN against other machine learning and deep learning approaches commonly used on medical text data – Naive Bayes, logistic regression, convolutional neural networks, and hierarchical attention networks (the previous state-of-the-art). We show that HiSANs are superior to other machine learning and deep learning text classifiers in both accuracy and macro F-score across all five classification tasks. Compared to the previous state-of-the-art, hierarchical attention networks, HiSANs not only are an order of magnitude faster to train, but also achieve about 1% better relative accuracy and 5% better relative macro F-score.
Original language | English |
---|---|
Article number | 101726 |
Journal | Artificial Intelligence in Medicine |
Volume | 101 |
DOIs | |
State | Published - Nov 2019 |
Funding
This work has been supported in part by the Joint Design of Advanced Computing Solutions for Cancer (JDACS4C) program established by the U.S. Department of Energy (DOE) and the National Cancer Institute (NCI) of the National Institutes of Health. This work was performed under the auspices of the U.S. Department of Energy by Argonne National Laboratory under Contract DE-AC02-06-CH11357, Lawrence Livermore National Laboratory under Contract DEAC52-07NA27344, Los Alamos National Laboratory under Contract DE-AC5206NA25396, and Oak Ridge National Laboratory under Contract DE-AC05-00OR22725. This research used resources of the Oak Ridge Leadership Computing Facility at the Oak Ridge National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC05-00OR22725.
Funders | Funder number |
---|---|
National Institutes of Health | |
U.S. Department of Energy | |
National Cancer Institute | |
Office of Science | |
Argonne National Laboratory | DE-AC02-06-CH11357 |
Argonne National Laboratory | |
Lawrence Livermore National Laboratory | DEAC52-07NA27344 |
Lawrence Livermore National Laboratory | |
Oak Ridge National Laboratory | DE-AC05-00OR22725 |
Oak Ridge National Laboratory | |
Los Alamos National Laboratory | DE-AC5206NA25396 |
Los Alamos National Laboratory |
Keywords
- Cancer pathology reports
- Clinical reports
- Deep learning
- Natural language processing
- Text classification