Abstract
Objectives: No existing algorithm can reliably identify metastasis from pathology reports across multiple cancer types and the entire US population. In this study, we develop a deep learning model that automatically detects patients with metastatic cancer by using pathology reports from many laboratories and of multiple cancer types. Materials and Methods: We use 60 471 unstructured pathology reports from 4 Surveillance, Epidemiology, and End Results (SEER) registries. The reports were coded into 1 of 3 labels: metastasis negative, metastases positive, or metastasis undetermined. We utilize a task-specific deep neural network trained from scratch and compare its performance with a widely used large language model (LLM). Results: Our deep learning architecture trained on task-specific data outperforms a general-purpose LLM, with a recall of 0.894 compared to 0.824. We quantified model uncertainty and used it to defer reports for human review. We found that retaining 72.9% of reports increased recall from 0.894 to 0.969. Discussion: A smaller deep learning architecture trained on task-specific data outperforms a general LLM. Equally critical to model performance is the incorporation of uncertainty quantification, achieved here through an abstention mechanism. Conclusions : This study’s finding demonstrate the feasibility of developing algorithms to automatically identify metastatic cancer cases from unstructured pathology reports.
| Original language | English |
|---|---|
| Article number | ooaf070 |
| Journal | JAMIA Open |
| Volume | 8 |
| Issue number | 4 |
| DOIs | |
| State | Published - Aug 1 2025 |
Funding
Office of Science of the US Department of Energy: This manuscript has been authored by UT-Battelle, LLC, under contract DE-AC05-00OR22725 with the US Department of Energy (DOE). The US government retains and the publisher, by accepting the article for publication, acknowledges that the US government retains a nonexclusive, paid-up, irrevocable, worldwide license to publish or reproduce the published form of this manuscript, or allow others to do so, for US government purposes. DOE will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan ( http://energy.gov/downloads/doe- public-access-plan ). This work has been supported in part by the US Department of Energy (DOE) and the NCI of the National Institutes of Health. This work was performed under the auspices of the DOE by Oak Ridge National Laboratory under Contract DE-AC05-00OR22725.
Keywords
- machine learning
- metastasis
- natural language processing
- recurrence