TY - GEN
T1 - Proteome-scale Deployment of Protein Structure Prediction Workflows on the Summit Supercomputer
AU - Gao, Mu
AU - Coletti, Mark
AU - Davidson, Russell B.
AU - Prout, Ryan
AU - Abraham, Subil
AU - Hernandez, Benjamin
AU - Sedova, Ada
N1 - Publisher Copyright:
© 2022 IEEE.
PY - 2022
Y1 - 2022
N2 - Deep learning has contributed to major advances in the prediction of protein structure from sequence, a fundamental problem in structural bioinformatics. With predictions now approaching the accuracy of crystallographic experiments, and with accelerators like GPUs and TPUs making inference using large models rapid, genome-level structure prediction becomes an obvious aim. Leadership-class computing resources can be used to perform genome-scale protein structure prediction using state-of-the-art deep learning models, providing a wealth of new data for systems biology applications. Here we describe our efforts to efficiently deploy the AlphaFold v.2 program, for full-proteome structure prediction, at scale on the Oak Ridge Leadership Computing Facility's resources, including the Summit supercomputer. We performed inference to produce the predicted structures for 40,526 protein sequences, corresponding to four prokaryotic proteomes and one plant proteome, using under 4,400 total Summit node hours, equivalent to using the majority of the supercomputer for a little over one hour. We also designed an optimized structure refinement that reduced the time for the relaxation stage of the AlphaFold pipeline by over 10X for longer sequences. We demonstrate the types of analyses that can be performed on proteome-scale collections of sequences, including a search for novel quaternary structures and implications for functional annotation.
AB - Deep learning has contributed to major advances in the prediction of protein structure from sequence, a fundamental problem in structural bioinformatics. With predictions now approaching the accuracy of crystallographic experiments, and with accelerators like GPUs and TPUs making inference using large models rapid, genome-level structure prediction becomes an obvious aim. Leadership-class computing resources can be used to perform genome-scale protein structure prediction using state-of-the-art deep learning models, providing a wealth of new data for systems biology applications. Here we describe our efforts to efficiently deploy the AlphaFold v.2 program, for full-proteome structure prediction, at scale on the Oak Ridge Leadership Computing Facility's resources, including the Summit supercomputer. We performed inference to produce the predicted structures for 40,526 protein sequences, corresponding to four prokaryotic proteomes and one plant proteome, using under 4,400 total Summit node hours, equivalent to using the majority of the supercomputer for a little over one hour. We also designed an optimized structure refinement that reduced the time for the relaxation stage of the AlphaFold pipeline by over 10X for longer sequences. We demonstrate the types of analyses that can be performed on proteome-scale collections of sequences, including a search for novel quaternary structures and implications for functional annotation.
KW - deep learning
KW - high-performance computing
KW - protein structure prediction
KW - proteomics
KW - workflow management software
UR - http://www.scopus.com/inward/record.url?scp=85130846619&partnerID=8YFLogxK
U2 - 10.1109/IPDPSW55747.2022.00045
DO - 10.1109/IPDPSW55747.2022.00045
M3 - Conference contribution
AN - SCOPUS:85130846619
T3 - Proceedings - 2022 IEEE 36th International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2022
SP - 206
EP - 215
BT - Proceedings - 2022 IEEE 36th International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2022
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 36th IEEE International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2022
Y2 - 30 May 2022 through 3 June 2022
ER -