Abstract
Computational biology is one of many scientific disciplines ripe for innovation and acceleration with the advent of high-performance computing (HPC). In recent years, the field of machine learning has also seen significant benefits from adopting HPC practices. In this work, we present a novel HPC pipeline that incorporates various machine-learning approaches for structure-based functional annotation of proteins on the scale of whole genomes. Our pipeline makes extensive use of deep learning and provides computational insights into best practices for training advanced deep-learning models for high-throughput data such as proteomics data. We showcase methodologies our pipeline currently supports and detail future tasks for our pipeline to envelop, including large-scale sequence comparison using SAdLSA and prediction of protein tertiary structures using AlphaFold2.
Original language | English |
---|---|
Title of host publication | Proceedings of MLHPC 2021 |
Subtitle of host publication | Workshop on Machine Learning in High Performance Computing Environments, Held in conjunction with SC 2021: The International Conference for High Performance Computing, Networking, Storage and Analysis |
Publisher | Institute of Electrical and Electronics Engineers Inc. |
Pages | 46-57 |
Number of pages | 12 |
ISBN (Electronic) | 9781665411240 |
DOIs | |
State | Published - 2021 |
Event | 7th IEEE/ACM Workshop on Machine Learning in High Performance Computing Environments, MLHPC 2021 - St. Louis, United States Duration: Nov 15 2021 → … |
Publication series
Name | Proceedings of MLHPC 2021: Workshop on Machine Learning in High Performance Computing Environments, Held in conjunction with SC 2021: The International Conference for High Performance Computing, Networking, Storage and Analysis |
---|
Conference
Conference | 7th IEEE/ACM Workshop on Machine Learning in High Performance Computing Environments, MLHPC 2021 |
---|---|
Country/Territory | United States |
City | St. Louis |
Period | 11/15/21 → … |
Funding
This research was partly sponsored by Office of Biological and Environmental Research s Genomic Science program within the US Department of Energy Office of Science, under award number ERKP917, the Laboratory Directed Research and Development Program at Oak Ridge National Laboratory (ORNL), and used resources of the Oak Ridge Leadership Computing Facility, which is a DOE Office of Science User Facility supported under Contract DE-AC05- 00OR22725, granted in part by the Advanced Scientific Computing Research (ASCR) Leadership Computing Challenge (ALCC) program. The development of the deep learning tools for protein domain prediction, protein model quality assessment, protein interaction prediction, and cryo-EM data analysis was supported by the National Science Foundation (DBI1759934 and IIS1763246), National Institutes of Health (R01GM093123), Department of Energy, USA (DEAR0001213, DE-SC0020400 and DE-SC0021303), and the Thompson Missouri Distinguished Professorship. The development of SAdLSA was supported in part by the National Institute Health (NIH R35GM118039) and used resources supported by the Partnership for an Advanced Computing Environment (PACE) at Georgia Tech. This manuscript has been authored by UT-Battelle, LLC under Contract No. DE-AC05-00OR22725 with the U.S. Department of Energy. The United States Government retains and the publisher, by accepting the article for publication, acknowledges that the United States Government retains a non-exclusive, paid-up, irrevocable, world-wide license to publish or reproduce the published form of this manuscript, or allow others to do so, for United States Government purposes. The Department of Energy will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan (http://energy.gov/ downloads/doe-public-access-plan). This research was partly sponsored by Office of Biological and Environmental Research’s Genomic Science program within the US Department of Energy Office of Science, under award number ERKP917, the Laboratory Directed Research and Development Program at Oak Ridge National Laboratory (ORNL), and used resources of the Oak Ridge Leadership Computing Facility, which is a DOE Office of Science User Facility supported under Contract DE-AC05-00OR22725, granted in part by the Advanced Scientific Computing Research (ASCR) Leadership Computing Challenge (ALCC) program. The development of the deep learning tools for protein domain prediction, protein model quality assessment, protein interaction prediction, and cryo-EM data analysis was supported by the National Science Foundation (DBI1759934 and IIS1763246), National Institutes of Health (R01GM093123), Department of Energy, USA (DEAR0001213, DE-SC0020400 and DE-SC0021303), and the Thompson Missouri Distinguished Professorship. The development of SAdLSA was supported in part by the National Institute Health (NIH R35GM118039) and used resources supported by the Partnership for an Advanced Computing Environment (PACE) at Georgia Tech.
Keywords
- computational biology
- deep learning
- high-performance computing
- machine learning
- protein sequence alignment
- protein structure prediction