Classifying cancer pathology reports with hierarchical self-attention networks

Shang Gao, John X. Qiu, Mohammed Alawad, Jacob D. Hinkle, Noah Schaefferkoetter, Hong Jun Yoon, Blair Christian, Paul A. Fearn, Lynne Penberthy, Xiao Cheng Wu, Linda Coyle, Georgia Tourassi, Arvind Ramanathan

Research output: Contribution to journalArticlepeer-review

38 Scopus citations

Abstract

We introduce a deep learning architecture, hierarchical self-attention networks (HiSANs), designed for classifying pathology reports and show how its unique architecture leads to a new state-of-the-art in accuracy, faster training, and clear interpretability. We evaluate performance on a corpus of 374,899 pathology reports obtained from the National Cancer Institute's (NCI) Surveillance, Epidemiology, and End Results (SEER) program. Each pathology report is associated with five clinical classification tasks – site, laterality, behavior, histology, and grade. We compare the performance of the HiSAN against other machine learning and deep learning approaches commonly used on medical text data – Naive Bayes, logistic regression, convolutional neural networks, and hierarchical attention networks (the previous state-of-the-art). We show that HiSANs are superior to other machine learning and deep learning text classifiers in both accuracy and macro F-score across all five classification tasks. Compared to the previous state-of-the-art, hierarchical attention networks, HiSANs not only are an order of magnitude faster to train, but also achieve about 1% better relative accuracy and 5% better relative macro F-score.

Original languageEnglish
Article number101726
JournalArtificial Intelligence in Medicine
Volume101
DOIs
StatePublished - Nov 2019

Funding

This work has been supported in part by the Joint Design of Advanced Computing Solutions for Cancer (JDACS4C) program established by the U.S. Department of Energy (DOE) and the National Cancer Institute (NCI) of the National Institutes of Health. This work was performed under the auspices of the U.S. Department of Energy by Argonne National Laboratory under Contract DE-AC02-06-CH11357, Lawrence Livermore National Laboratory under Contract DEAC52-07NA27344, Los Alamos National Laboratory under Contract DE-AC5206NA25396, and Oak Ridge National Laboratory under Contract DE-AC05-00OR22725. This research used resources of the Oak Ridge Leadership Computing Facility at the Oak Ridge National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC05-00OR22725.

FundersFunder number
National Institutes of Health
U.S. Department of Energy
National Cancer Institute
Office of Science
Argonne National LaboratoryDE-AC02-06-CH11357
Argonne National Laboratory
Lawrence Livermore National LaboratoryDEAC52-07NA27344
Lawrence Livermore National Laboratory
Oak Ridge National LaboratoryDE-AC05-00OR22725
Oak Ridge National Laboratory
Los Alamos National LaboratoryDE-AC5206NA25396
Los Alamos National Laboratory

    Keywords

    • Cancer pathology reports
    • Clinical reports
    • Deep learning
    • Natural language processing
    • Text classification

    Fingerprint

    Dive into the research topics of 'Classifying cancer pathology reports with hierarchical self-attention networks'. Together they form a unique fingerprint.

    Cite this