Scalable deep text comprehension for Cancer surveillance on high-performance computing

John X. Qiu, Hong Jun Yoon, Kshitij Srivastava, Thomas P. Watson, J. Blair Christian, Arvind Ramanathan, Xiao C. Wu, Paul A. Fearn, Georgia D. Tourassi

Research output: Contribution to journalArticlepeer-review

8 Scopus citations

Abstract

Background: Deep Learning (DL) has advanced the state-of-the-art capabilities in bioinformatics applications which has resulted in trends of increasingly sophisticated and computationally demanding models trained by larger and larger data sets. This vastly increased computational demand challenges the feasibility of conducting cutting-edge research. One solution is to distribute the vast computational workload across multiple computing cluster nodes with data parallelism algorithms. In this study, we used a High-Performance Computing environment and implemented the Downpour Stochastic Gradient Descent algorithm for data parallelism to train a Convolutional Neural Network (CNN) for the natural language processing task of information extraction from a massive dataset of cancer pathology reports. We evaluated the scalability improvements using data parallelism training and the Titan supercomputer at Oak Ridge Leadership Computing Facility. To evaluate scalability, we used different numbers of worker nodes and performed a set of experiments comparing the effects of different training batch sizes and optimizer functions. Results: We found that Adadelta would consistently converge at a lower validation loss, though requiring over twice as many training epochs as the fastest converging optimizer, RMSProp. The Adam optimizer consistently achieved a close 2nd place minimum validation loss significantly faster; using a batch size of 16 and 32 allowed the network to converge in only 4.5 training epochs. Conclusions: We demonstrated that the networked training process is scalable across multiple compute nodes communicating with message passing interface while achieving higher classification accuracy compared to a traditional machine learning algorithm.

Original languageEnglish
Article number488
JournalBMC Bioinformatics
Volume19
DOIs
StatePublished - Dec 21 2018

Funding

This manuscript has been authored by UT-Battelle, LLC under Contract No. DE-AC05-00OR22725 with the U.S.~Department of Energy. The United States Government retains and the publisher, by accepting the article for publication, acknowledges that the United States Government retains a non-exclusive, paid-up, irrevocable, world-wide license to publish or reproduce the published form of the manuscript, or allow others to do so, for United States Government purposes. The Department of Energy will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan (http://energy.gov/downloads/doe-public-access-plan). This research used resources of the Oak Ridge Leadership Computing Facility at the Oak Ridge National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC05-00OR22725. This work has been supported in part by the Joint Design of Advanced Computing Solutions for Cancer (JDACS4C) program established by the U.S. Department of Energy (DOE) and the National Cancer Institute (NCI) of National Institutes of Health. This work was performed under the auspices of the U.S. Department of Energy by Argonne National Laboratory under Contract DE-AC02–06-CH11357, Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344, Los Alamos National Laboratory under Contract DE-AC5206NA25396, and Oak Ridge National Laboratory under Contract DE-AC05-00OR22725. Funding to pay the Open Access publication charges for this article was provided by the U.S. Department of Energy and the National Cancer Institute of National Institutes of Health.

FundersFunder number
National Cancer Institute of National Institutes of Health
National Institutes of Health
U.S. Department of Energy
National Cancer Institute
Office of Science
Argonne National LaboratoryDE-AC02–06-CH11357
Lawrence Livermore National LaboratoryDE-AC52-07NA27344
Oak Ridge National LaboratoryDE-AC05-00OR22725
Los Alamos National LaboratoryDE-AC5206NA25396

    Fingerprint

    Dive into the research topics of 'Scalable deep text comprehension for Cancer surveillance on high-performance computing'. Together they form a unique fingerprint.

    Cite this