Abstract
De novo peptide design is a new frontier that has broad application potential in the biological and biomedical fields. Most existing models for de novo peptide design are largely based on sequence homology that can be restricted based on evolutionarily derived protein sequences and lack the physicochemical context essential in protein folding. Generative machine learning for de novo peptide design is a promising way to synthesize theoretical data that are based on, but unique from, the observable universe. In this study, we created and tested a custom peptide generative adversarial network intended to design peptide sequences that can fold into the β-hairpin secondary structure. This deep neural network model is designed to establish a preliminary foundation of the generative approach based on physicochemical and conformational properties of 20 canonical amino acids, for example, hydrophobicity and residue volume, using extant structure-specific sequence data from the PDB. The beta generative adversarial network model robustly distinguishes secondary structures of β hairpin from α helix and intrinsically disordered peptides with an accuracy of up to 96% and generates artificial β-hairpin peptide sequences with minimum sequence identities around 31% and 50% when compared against the current NCBI PDB and nonredundant databases, respectively. These results highlight the potential of generative models specifically anchored by physicochemical and conformational property features of amino acids to expand the sequence-to-structure landscape of proteins beyond evolutionary limits.
Original language | English |
---|---|
Journal | Biophysical Journal |
DOIs | |
State | Accepted/In press - 2024 |
Funding
We would like to acknowledge the Georgia Tech (GT) Southeast Center for Math and Biology (SCMB) for ongoing scientific feedback on this project. Thanks to Gary Newman, Bettina Bommarius, Nicholas Hud, and the GT Institute for Bioengineering and Bioscience (IBB) core facilities for support. This work was funded by NSF-Simons grant 1764406 (to J.M. and M.T.) and by NIH R01-GM148586 (to J.C.G.). G.D.V.S. and J.M. acknowledge the CADES and Summit computational resources provided through the Oak Ridge Leadership Computing Facility. A.C.M. acknowledges start-up funds from the Georgia Institute of Technology. MD simulations were run using resources provided through the Extreme Science and Engineering Discovery Environment (XSEDE, TG-MCB130173), which is supported by NSF grant ACI-1548562, as well as the Hive cluster, which is supported by NSF grant 1828187 and is managed by the Partnership for an Advanced Computing Environment (PACE) at the Georgia Institute of Technology. The authors declare no competing interests. We would like to acknowledge the Georgia Tech (GT) Southeast Center for Math and Biology (SCMB) for ongoing scientific feedback on this project. Thanks to Gary Newman, Bettina Bommarius, Nicholas Hud, and the GT Institute for Bioengineering and Bioscience (IBB) core facilities for support. This work was funded by NSF-Simons grant 1764406 (to J.M. and M.T.) and by NIH R01-GM148586 (to J.C.G.). G.D.,V.S., and J.M. acknowledge the CADES and Summit computational resources provided through the Oak Ridge Leadership Computing Facility . A.C.M. acknowledges start-up funds from the Georgia Institute of Technology . MD simulations were run using resources provided through the Extreme Science and Engineering Discovery Environment (XSEDE, TG-MCB130173 ), which is supported by NSF grant ACI-1548562 , as well as the Hive cluster, which is supported by NSF grant 1828187 and is managed by the Partnership for an Advanced Computing Environment (PACE) at the Georgia Institute of Technology .