Abstract
Deep learning has contributed to major advances in the prediction of protein structure from sequence, a fundamental problem in structural bioinformatics. With predictions now approaching the accuracy of crystallographic experiments, and with accelerators like GPUs and TPUs making inference using large models rapid, genome-level structure prediction becomes an obvious aim. Leadership-class computing resources can be used to perform genome-scale protein structure prediction using state-of-the-art deep learning models, providing a wealth of new data for systems biology applications. Here we describe our efforts to efficiently deploy the AlphaFold v.2 program, for full-proteome structure prediction, at scale on the Oak Ridge Leadership Computing Facility's resources, including the Summit supercomputer. We performed inference to produce the predicted structures for 40,526 protein sequences, corresponding to four prokaryotic proteomes and one plant proteome, using under 4,400 total Summit node hours, equivalent to using the majority of the supercomputer for a little over one hour. We also designed an optimized structure refinement that reduced the time for the relaxation stage of the AlphaFold pipeline by over 10X for longer sequences. We demonstrate the types of analyses that can be performed on proteome-scale collections of sequences, including a search for novel quaternary structures and implications for functional annotation.
Original language | English |
---|---|
Title of host publication | Proceedings - 2022 IEEE 36th International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2022 |
Publisher | Institute of Electrical and Electronics Engineers Inc. |
Pages | 206-215 |
Number of pages | 10 |
ISBN (Electronic) | 9781665497473 |
DOIs | |
State | Published - 2022 |
Event | 36th IEEE International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2022 - Virtual, Online, France Duration: May 30 2022 → Jun 3 2022 |
Publication series
Name | Proceedings - 2022 IEEE 36th International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2022 |
---|
Conference
Conference | 36th IEEE International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2022 |
---|---|
Country/Territory | France |
City | Virtual, Online |
Period | 05/30/22 → 06/3/22 |
Funding
This research was sponsored in part by the Office of Biological and Environmental Research’s Genomic Science program within the US Department of Energy Office of Science, under award number ERKP917, the Laboratory Directed Research and Development Program at Oak Ridge National Laboratory (ORNL), and used resources of the Oak Ridge Leadership Computing Facility, which is a DOE Office of Science User Facility supported under Contract DE-AC05-00OR22725, granted in part by the Advanced Scientific Computing Research (ASCR) Leadership Computing Challenge (ALCC) program, resources supported by the Partnership for an Advanced Computing Environment (PACE) at Georgia Tech. We thank Bryan Piatkowski, Jerry Parks and Justin North for genome information. Notice: This manuscript has been authored in part by UT-Battelle, LLC under Contract No. DE-AC05-00OR22725 with the U.S. Department of Energy. The United States Government retains and the publisher, by accepting the article for publication, acknowledges that the United States Government retains a non-exclusive, paid-up, irrevocable, world-wide license to publish or reproduce the published form of this manuscript, or allow others to do so, for United States Government purposes. The Department of Energy will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan (http://energy.gov/downloads/doe-public-access-plan).
Keywords
- deep learning
- high-performance computing
- protein structure prediction
- proteomics
- workflow management software