Privacy-Preserving Deep Learning NLP Models for Cancer Registries

Mohammed Alawad, Hong Jun Yoon, Shang Gao, Brent Mumphrey, Xiao Cheng Wu, Eric B. Durbin, Jong Cheol Jeong, Isaac Hands, David Rust, Linda Coyle, Lynne Penberthy, Georgia Tourassi

Research output: Contribution to journalArticlepeer-review

25 Scopus citations

Abstract

Population cancer registries can benefit from Deep Learning (DL) to automatically extract cancer characteristics from the high volume of unstructured pathology text reports they process annually. The success of DL to tackle this and other real-world problems is proportional to the availability of large labeled datasets for model training. Although collaboration among cancer registries is essential to fully exploit the promise of DL, privacy and confidentiality concerns are main obstacles for data sharing across cancer registries. Moreover, DL for natural language processing (NLP) requires sharing a vocabulary dictionary for the embedding layer which may contain patient identifiers. Thus, even distributing the trained models across cancer registries causes a privacy violation issue. In this article, we propose DL NLP model distribution via privacy-preserving transfer learning approaches without sharing sensitive data. These approaches are used to distribute a multitask convolutional neural network (MT-CNN) NLP model among cancer registries. The model is trained to extract six key cancer characteristics - tumor site, subsite, laterality, behavior, histology, and grade - from cancer pathology reports. Using 410,064 pathology documents from two cancer registries, we compare our proposed approach to conventional transfer learning without privacy-preserving, single-registry models, and a model trained on centrally hosted data. The results show that transfer learning approaches including data sharing and model distribution outperform significantly the single-registry model. In addition, the best performing privacy-preserving model distribution approach achieves statistically indistinguishable average micro- and macro-F1 scores across all extraction tasks (0.823,0.580) as compared to the centralized model (0.827,0.585).

Original languageEnglish
Pages (from-to)1219-1230
Number of pages12
JournalIEEE Transactions on Emerging Topics in Computing
Volume9
Issue number3
DOIs
StatePublished - 2021

Funding

This work has been supported in part by the Joint Design of Advanced Computing Solutions for Cancer (JDACS4C) program established by the U.S. Department of Energy (DOE) and the National Cancer Institute (NCI) of the National Institutes of Health. This work was performed under the auspices of the U.S. Department of Energy by Argonne National Laboratory under Contract DE-AC02-06-CH11357, Lawrence Livermore National Laboratory under Contract DEAC52-07NA27344, Los Alamos National Laboratory under Contract DE-AC5206NA25396, and Oak Ridge National Laboratory under Contract DE-AC05-00OR22725. This work has also been supported by National Cancer Institute under Contract No. HHSN261201800013I/HHSN26100001 and NCI Cancer Center Support Grant (P30CA177558). This manuscript has been authored by UT-Battelle, LLC under Contract No. DE-AC05-00OR22725 with the U.S. Department of Energy. The United States Government retains and the publisher, by accepting the article for publication, acknowledges that the United States Government retains a non-exclusive, paidup, irrevocable, world-wide license to publish or reproduce the published form of this manuscript, or allow others to do so, for United States Government purposes. The Department of Energy will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan (http://energy.gov/downloads/doe-public-access-plan).

FundersFunder number
National Institutes of Health
U.S. Department of Energy
National Cancer Institute
Argonne National LaboratoryDE-AC02-06-CH11357
Lawrence Livermore National LaboratoryDEAC52-07NA27344
Oak Ridge National LaboratoryDE-AC05-00OR22725, P30CA177558, HHSN261201800013I/HHSN26100001
Los Alamos National LaboratoryDE-AC5206NA25396

    Keywords

    • NLP
    • Privacy-preserving
    • cancer pathology reports
    • information extraction
    • multi-task CNN
    • transfer learning

    Fingerprint

    Dive into the research topics of 'Privacy-Preserving Deep Learning NLP Models for Cancer Registries'. Together they form a unique fingerprint.

    Cite this