Path-BigBird: An AI-Driven Transformer Approach to Classification of Cancer Pathology Reports

Research output: Contribution to journalArticlepeer-review

12 Scopus citations

Abstract

PURPOSE Surgical pathology reports are critical for cancer diagnosis and management. To accurately extract information about tumor characteristics from pathology reports in near real time, we explore the impact of using domain-specific transformer models that understand cancer pathology reports.METHODS We built a pathology transformer model, Path-BigBird, by using 2.7 million pathology reports from six SEER cancer registries. We then compare different variations of Path-BigBird with two less computationally intensive methods: Hierarchical Self-Attention Network (HiSAN) classification model and an off-the-shelf clinical transformer model (Clinical BigBird). We use five pathology information extraction tasks for evaluation: site, subsite, laterality, histology, and behavior. Model performance is evaluated by using macro and micro F1 scores.RESULTS We found that Path-BigBird and Clinical BigBird outperformed the HiSAN in all tasks. Clinical BigBird performed better on the site and laterality tasks. Versions of the Path-BigBird model performed best on the two most difficult tasks: subsite (micro F1 score of 72.53, macro F1 score of 35.76) and histology (micro F1 score of 80.96, macro F1 score of 37.94). The largest performance gains over the HiSAN model were for histology, for which a Path-BigBird model increased the micro F1 score by 1.44 points and the macro F1 score by 3.55 points. Overall, the results suggest that a Path-BigBird model with a vocabulary derived from well-curated and deidentified data is the best-performing model.CONCLUSIONThe Path-BigBird pathology transformer model improves automated information extraction from pathology reports. Although Path-BigBird outperforms Clinical BigBird and HiSAN, these less computationally expensive models still have utility when resources are constrained.

Original languageEnglish
Article numbere2300148
JournalJCO clinical cancer informatics
Volume8
DOIs
StatePublished - 2024

Funding

Supported in part by the Joint Design of Advanced Computing Solutions for Cancer (JDACS4C) program established by the US Department of Energy (DOE) and the NCI of the National Institutes of Health. This work was performed under the auspices of the DOE by the Argonne National Laboratory under Contract DE-AC0206-CH11357, Lawrence Livermore National Laboratory under Contract DEAC52-07NA27344, Los Alamos National Laboratory under Contract DE-AC5206NA25396, and Oak Ridge National Laboratory under Contract DE-AC0500OR22725. The research was supported by the Exascale Computing Project (17-SC-20-SC), a collaborative effort of the US Department of Energy Office of Science and the National Nuclear Security Administration. The authors acknowledge contributions to this study from staff members in the participating central cancer registries listed below. These registries are supported by the National Cancer Institute s SEER Program, the Centers for Disease Control and Prevention s National Program of Cancer Registries (NPCR), and/or state agencies, universities, and cancer centers. Kentucky Cancer Registry working under contract numbers SEER: HHSN261201800013I/HHSN26100001 and NPCR: NU58DP003907. Louisiana Tumor Registry working under contract numbers SEER: HHSN261201800007I/HHSN26100002 and NPCR: NU58DP0063. New Jersey State Cancer Registry working under contract numbers SEER: 75N91021D0000/75N91021F00001 and NPCR: NU58DP006279. New Mexico Tumor Registry working under contract numbers SEER: HHSN261601800014l. Fred Hutchinson Cancer Research Center working under contract numbers SEER: HHSN2612018000041. Utah Cancer Registry working under contract numbers SEER: HHSN261201800016I and NPCR: NU58DP006320. This research used resources of the Oak Ridge Leadership Computing Facility at the Oak Ridge National Laboratory, which is supported by the Office of Science of the US Department of Energy under Contract No. DE-AC05-00OR22725. This manuscript has been authored by UTBattelle LLC under contract DE-ACO5-000R22725 with the DOE. The publisher, by accepting the article for publication, acknowledges that the US government retains a nonexclusive, paid-up, irrevocable, worldwide license to publish or reproduce the published form of this manuscript, or allow others to do so, for US government purposes. DOE will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan (http://energy.gov/downloads/doe-public-access-plan). The authors acknowledge contributions to this study from staff members in the participating central cancer registries listed below. These registries are supported by the National Cancer Institute's SEER Program, the Centers for Disease Control and Prevention's National Program of Cancer Registries (NPCR), and/or state agencies, universities, and cancer centers. Kentucky Cancer Registry working under contract numbers SEER: HHSN261201800013I/HHSN26100001 and NPCR: NU58DP003907. Louisiana Tumor Registry working under contract numbers SEER: HHSN261201800007I/HHSN26100002 and NPCR: NU58DP0063. New Jersey State Cancer Registry working under contract numbers SEER: 75N91021D0000/75N91021F00001 and NPCR: NU58DP006279. New Mexico Tumor Registry working under contract numbers SEER: HHSN261601800014l. Fred Hutchinson Cancer Research Center working under contract numbers SEER: HHSN2612018000041. Utah Cancer Registry working under contract numbers SEER: HHSN261201800016I and NPCR: NU58DP006320. This research used resources of the Oak Ridge Leadership Computing Facility at the Oak Ridge National Laboratory, which is supported by the Office of Science of the US Department of Energy under Contract No. DE-AC05-00OR22725. This manuscript has been authored by UT-Battelle LLC under contract DE-ACO5-000R22725 with the DOE. The publisher, by accepting the article for publication, acknowledges that the US government retains a nonexclusive, paid-up, irrevocable, worldwide license to publish or reproduce the published form of this manuscript, or allow others to do so, for US government purposes. DOE will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan ( http://energy.gov/downloads/doe-public-access-plan ).

Fingerprint

Dive into the research topics of 'Path-BigBird: An AI-Driven Transformer Approach to Classification of Cancer Pathology Reports'. Together they form a unique fingerprint.

Cite this