Limitations of Transformers on Clinical Text Classification

Shang Gao, Mohammed Alawad, M. Todd Young, John Gounley, Noah Schaefferkoetter, Hong Jun Yoon, Xiao Cheng Wu, Eric B. Durbin, Jennifer Doherty, Antoinette Stroup, Linda Coyle, Georgia Tourassi

Research output: Contribution to journalArticlepeer-review

95 Scopus citations

Abstract

Bidirectional Encoder Representations from Transformers (BERT) and BERT-based approaches are the current state-of-the-art in many natural language processing (NLP) tasks; however, their application to document classification on long clinical texts is limited. In this work, we introduce four methods to scale BERT, which by default can only handle input sequences up to approximately 400 words long, to perform document classification on clinical texts several thousand words long. We compare these methods against two much simpler architectures - a word-level convolutional neural network and a hierarchical self-attention network - and show that BERT often cannot beat these simpler baselines when classifying MIMIC-III discharge summaries and SEER cancer pathology reports. In our analysis, we show that two key components of BERT - pretraining and WordPiece tokenization - may actually be inhibiting BERT's performance on clinical text classification tasks where the input document is several thousand words long and where correctly identifying labels may depend more on identifying a few key words or phrases rather than understanding the contextual meaning of sequences of text.

Original languageEnglish
Article number9364676
Pages (from-to)3596-3607
Number of pages12
JournalIEEE Journal of Biomedical and Health Informatics
Volume25
Issue number9
DOIs
StatePublished - Sep 2021

Funding

Manuscript received September 14, 2020; revised February 5, 2021; accepted February 22, 2021. Date of publication February 26, 2021; date of current version September 3, 2021. This work was supported in part by the Joint Design of Advanced Computing Solutions for Cancer (JDACS4C) program established by the U.S. Department of Energy (DOE) and National Cancer Institute (NCI) of the National Institutes of Health, in part by the auspices of the U.S. Department of Energy by Argonne National Laboratory under Contract No. DE-AC02-06-CH11357, in part by Lawrence Livermore National Laboratory under Contract No. DEAC52-07NA27344, in part by Los Alamos National Laboratory under Contract No. DE-AC5206NA25396, and in part by Oak Ridge National Laboratory under Contract No. DE-AC05-00OR22725. This work used resources of Oak Ridge Leadership Computing Facility at Oak Ridge National Laboratory, which was supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC05-00OR22725. This work was supported by Exascale Computing Project (17-SC-20-SC), a collaborative effort of the U.S. Department of Energy Office of Science and National Nuclear Security Administration. (Corresponding authors: Shang Gao; Georgia Tourassi.) Shang Gao, Mohammed Alawad, M. Todd Young, John Gounley, Noah Schaefferkoetter, Hong Jun Yoon, and Georgia Tourassi are with Oak Ridge National Laboratory, Oak Ridge, TN 37830 USA (e-mail: [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]).

FundersFunder number
U.S. Department of Energy Office of Science and National Nuclear Security Administration
National Institutes of Health
U.S. Department of Energy
National Cancer InstituteP30CA177558
Office of Science17-SC-20-SC
Argonne National LaboratoryDE-AC02-06-CH11357
Lawrence Livermore National LaboratoryDEAC52-07NA27344
Oak Ridge National LaboratoryDE-AC05-00OR22725
Los Alamos National LaboratoryDE-AC5206NA25396

    Keywords

    • BERT
    • clinical text
    • deep learning
    • natural language processing
    • neural networks
    • text classification

    Fingerprint

    Dive into the research topics of 'Limitations of Transformers on Clinical Text Classification'. Together they form a unique fingerprint.

    Cite this