High-Performance Deep Learning Toolbox for Genome-Scale Prediction of Protein Structure and Function

Mu Gao, Peik Lund-Andersen, Alex Morehead, Sajid Mahmud, Chen Chen, Xiao Chen, Nabin Giri, Raj S. Roy, Farhan Quadir, T. Chad Effler, Ryan Prout, Subil Abraham, Wael Elwasif, N. Quentin Haas, Jeffrey Skolnick, Jianlin Cheng, Ada Sedova

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

9 Scopus citations

Abstract

Computational biology is one of many scientific disciplines ripe for innovation and acceleration with the advent of high-performance computing (HPC). In recent years, the field of machine learning has also seen significant benefits from adopting HPC practices. In this work, we present a novel HPC pipeline that incorporates various machine-learning approaches for structure-based functional annotation of proteins on the scale of whole genomes. Our pipeline makes extensive use of deep learning and provides computational insights into best practices for training advanced deep-learning models for high-throughput data such as proteomics data. We showcase methodologies our pipeline currently supports and detail future tasks for our pipeline to envelop, including large-scale sequence comparison using SAdLSA and prediction of protein tertiary structures using AlphaFold2.

Original languageEnglish
Title of host publicationProceedings of MLHPC 2021
Subtitle of host publicationWorkshop on Machine Learning in High Performance Computing Environments, Held in conjunction with SC 2021: The International Conference for High Performance Computing, Networking, Storage and Analysis
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages46-57
Number of pages12
ISBN (Electronic)9781665411240
DOIs
StatePublished - 2021
Event7th IEEE/ACM Workshop on Machine Learning in High Performance Computing Environments, MLHPC 2021 - St. Louis, United States
Duration: Nov 15 2021 → …

Publication series

NameProceedings of MLHPC 2021: Workshop on Machine Learning in High Performance Computing Environments, Held in conjunction with SC 2021: The International Conference for High Performance Computing, Networking, Storage and Analysis

Conference

Conference7th IEEE/ACM Workshop on Machine Learning in High Performance Computing Environments, MLHPC 2021
Country/TerritoryUnited States
CitySt. Louis
Period11/15/21 → …

Funding

This research was partly sponsored by Office of Biological and Environmental Research s Genomic Science program within the US Department of Energy Office of Science, under award number ERKP917, the Laboratory Directed Research and Development Program at Oak Ridge National Laboratory (ORNL), and used resources of the Oak Ridge Leadership Computing Facility, which is a DOE Office of Science User Facility supported under Contract DE-AC05- 00OR22725, granted in part by the Advanced Scientific Computing Research (ASCR) Leadership Computing Challenge (ALCC) program. The development of the deep learning tools for protein domain prediction, protein model quality assessment, protein interaction prediction, and cryo-EM data analysis was supported by the National Science Foundation (DBI1759934 and IIS1763246), National Institutes of Health (R01GM093123), Department of Energy, USA (DEAR0001213, DE-SC0020400 and DE-SC0021303), and the Thompson Missouri Distinguished Professorship. The development of SAdLSA was supported in part by the National Institute Health (NIH R35GM118039) and used resources supported by the Partnership for an Advanced Computing Environment (PACE) at Georgia Tech. This manuscript has been authored by UT-Battelle, LLC under Contract No. DE-AC05-00OR22725 with the U.S. Department of Energy. The United States Government retains and the publisher, by accepting the article for publication, acknowledges that the United States Government retains a non-exclusive, paid-up, irrevocable, world-wide license to publish or reproduce the published form of this manuscript, or allow others to do so, for United States Government purposes. The Department of Energy will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan (http://energy.gov/ downloads/doe-public-access-plan). This research was partly sponsored by Office of Biological and Environmental Research’s Genomic Science program within the US Department of Energy Office of Science, under award number ERKP917, the Laboratory Directed Research and Development Program at Oak Ridge National Laboratory (ORNL), and used resources of the Oak Ridge Leadership Computing Facility, which is a DOE Office of Science User Facility supported under Contract DE-AC05-00OR22725, granted in part by the Advanced Scientific Computing Research (ASCR) Leadership Computing Challenge (ALCC) program. The development of the deep learning tools for protein domain prediction, protein model quality assessment, protein interaction prediction, and cryo-EM data analysis was supported by the National Science Foundation (DBI1759934 and IIS1763246), National Institutes of Health (R01GM093123), Department of Energy, USA (DEAR0001213, DE-SC0020400 and DE-SC0021303), and the Thompson Missouri Distinguished Professorship. The development of SAdLSA was supported in part by the National Institute Health (NIH R35GM118039) and used resources supported by the Partnership for an Advanced Computing Environment (PACE) at Georgia Tech.

Keywords

  • computational biology
  • deep learning
  • high-performance computing
  • machine learning
  • protein sequence alignment
  • protein structure prediction

Fingerprint

Dive into the research topics of 'High-Performance Deep Learning Toolbox for Genome-Scale Prediction of Protein Structure and Function'. Together they form a unique fingerprint.

Cite this