Optimal vocabulary selection approaches for privacy-preserving deep NLP model training for information extraction and cancer epidemiology

Hong Jun Yoon, Christopher Stanley, J. Blair Christian, Hilda B. Klasky, Andrew E. Blanchard, Eric B. Durbin, Xiao Cheng Wu, Antoinette Stroup, Jennifer Doherty, Stephen M. Schwartz, Charles Wiggins, Mark Damesyn, Linda Coyle, Georgia D. Tourassi

Research output: Contribution to journalArticlepeer-review

7 Scopus citations

Abstract

Background: With the use of artificial intelligence and machine learning techniques for biomedical informatics, security and privacy concerns over the data and subject identities have also become an important issue and essential research topic. Without intentional safeguards, machine learning models may find patterns and features to improve task performance that are associated with private personal information. Objective: The privacy vulnerability of deep learning models for information extraction from medical textural contents needs to be quantified since the models are exposed to private health information and personally identifiable information. The Objective of the study is to quantify the privacy vulnerability of the deep learning models for natural language processing and explore a proper way of securing patients' information to mitigate confidentiality breaches. METHODS: The target model is the multitask convolutional neural network for information extraction from cancer pathology reports, where the data for training the model are from multiple state population-based cancer registries. This study proposes the following schemes to collect vocabularies from the cancer pathology reports; (a) words appearing in multiple registries, and (b) words that have higher mutual information. We performed membership inference attacks on the models in high-performance computing environments. Results: The comparison outcomes suggest that the proposed vocabulary selection methods resulted in lower privacy vulnerability while maintaining the same level of clinical task performance.

Original languageEnglish
Pages (from-to)185-198
Number of pages14
JournalCancer Biomarkers
Volume33
Issue number2
DOIs
StatePublished - 2022

Funding

The Utah Cancer Registry is funded by the National Cancer Institute’s SEER Program, Contract No. HHSN261201800016I, and the US Centers for Disease Control and Prevention’s National Program of Cancer Registries, Cooperative Agreement No. NU58DP0063 200, with additional support from the University of Utah and Huntsman Cancer Foundation. This work was performed under the auspices of DOE by Argonne National Laboratory under Contract DE-AC02-06-CH11357, Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344, Los Alamos National Laboratory under Contract DE-AC5206NA25396, and Oak Ridge National Laboratory under Contract DE-AC05-00OR22725. This work also was supported by the Laboratory Directed Research and Development (LDRD) program of Oak Ridge National Laboratory, under LDRD project 9831. The collection of cancer incidence data used in this study was supported by the California Department of Public Health pursuant to California Health and Safety Code Section 103885; Centers for Disease Control and Prevention’s (CDC) National Program of Cancer Registries, under cooperative agreement 5NU58DP006344; the National Cancer Institute’s Surveillance, Epidemiology and End Results Program under contract HHSN261201800032I awarded to the University of California, San Francisco; contract HHSN261201800015I awarded to the University of Southern California; and contract HHSN261201800009I awarded to the Public Health Institute. The ideas and opinions expressed herein are those of the author(s) and do not necessarily reflect the opinions of the State of California, Department of Public Health, the National Cancer Institute, and the Centers for Disease Control and Prevention or their contractors and subcontractors. This research was supported by the Exascale Computing Project (17-SC-20-SC), a collaborative effort of the US Department of Energy (DOE) Office of Science and the National Nuclear Security Administration. This work was also supported in part by the Joint Design of Advanced Computing Solutions for Cancer (JDACS4C) program established by DOE and the NCI of the National Institutes of Health. New Mexico Tumor Registry’s participation in this project was supported by Contract HHSN26120180001 4I, Task Order HHSN26100001 from the National Cancer Institute’s Surveillance, Epidemiology and End Results (SEER) Program. This research used resources of the Oak Ridge Leadership Computing Facility at the Oak Ridge National Laboratory, which is supported by the DOE Office of Science under Contract No. DE-AC05-00OR22725. New Jersey State Cancer Registry data were collected using funding from NCI and the Surveillance, Epidemi- ology and End Results (SEER) Program (HHSN26120 1300021I), the CDC’s National Program of Cancer Registries (NPCR) (NU58DP006279-02-00) as well as the State of New Jersey and the Rutgers Cancer Institute of New Jersey. Louisiana Tumor Registry data were collected using funding from NCI and the Surveillance, Epidemiology and End Results (SEER) Program (HHSN26120180000 7I), the CDC’s National Program of Cancer Registries (NPCR) (NU58DP006332-02-00) as well as the State of Louisiana. This manuscript has been authored in part by UT-Battelle, LLC, under contract DE-AC05-00OR22725 with the US Department of Energy (DOE). The US government retains and the publisher, by accepting the article for publication, acknowledges that the US government retains a nonexclusive, paid-up, irrevocable, worldwide license to publish or reproduce the published form of this manuscript, or allow others to do so, for US government purposes. DOE will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan (http://energy.gov/downloads/doe-public-access-plan). Kentucky Cancer Registry data were collected with funding from NCI Surveillance, Epidemiology and End Results (SEER) Program (HHSN261201800013I), the CDC National Program of Cancer Registries (NPCR) (U58DP00003907) and the Commonwealth of Kentucky. The Cancer Surveillance System is supported by the National Cancer Institute’s SEER Program (Contract Award HHSN261291800004I) and with additional funds provided by the Fred Hutchinson Cancer Research Center.

FundersFunder number
CDC National Program of Cancer Registries
NCI Surveillance, Epidemiology and End Results
NPCRU58DP00003907
National Cancer Institute’s Surveillance, Epidemiology and End Results
National Cancer Institute’s Surveillance, Epidemiology and End Results Program
SEERNU58DP006332-02-00, HHSN26120180000 7I, HHSN261201800013I
State of New Jersey
Surveillance, Epidemiology and End Results
University of Utah and Huntsman Cancer Foundation
National Institutes of Health
U.S. Department of Energy
Centers for Disease Control and Prevention5NU58DP006344
National Cancer InstituteP30CA177558, NU58DP0063 200, HHSN261291800004I, HHSN26120 1300021I, NU58DP006279-02-00
University of Southern California
Office of Science
National Nuclear Security Administration
Argonne National LaboratoryDE-AC02-06-CH11357
Lawrence Livermore National LaboratoryDE-AC52-07NA27344
Oak Ridge National LaboratoryDE-AC05-00OR22725
Laboratory Directed Research and Development9831
Rutgers Cancer Institute of New Jersey
Los Alamos National LaboratoryDE-AC5206NA25396

    Keywords

    • Privacy
    • artificial intelligence
    • cancer epidemiology
    • deep learning
    • natural language processing
    • privacy-preserving training

    Fingerprint

    Dive into the research topics of 'Optimal vocabulary selection approaches for privacy-preserving deep NLP model training for information extraction and cancer epidemiology'. Together they form a unique fingerprint.

    Cite this