Abstract
Background: We examine the problem of clustering biomolecular simulations using deep learning techniques. Since biomolecular simulation datasets are inherently high dimensional, it is often necessary to build low dimensional representations that can be used to extract quantitative insights into the atomistic mechanisms that underlie complex biological processes. Results: We use a convolutional variational autoencoder (CVAE) to learn low dimensional, biophysically relevant latent features from long time-scale protein folding simulations in an unsupervised manner. We demonstrate our approach on three model protein folding systems, namely Fs-peptide (14 μs aggregate sampling), villin head piece (single trajectory of 125 μs) and β- β- α (BBA) protein (223 + 102 μs sampling across two independent trajectories). In these systems, we show that the CVAE latent features learned correspond to distinct conformational substates along the protein folding pathways. The CVAE model predicts, on average, nearly 89% of all contacts within the folding trajectories correctly, while being able to extract folded, unfolded and potentially misfolded states in an unsupervised manner. Further, the CVAE model can be used to learn latent features of protein folding that can be applied to other independent trajectories, making it particularly attractive for identifying intrinsic features that correspond to conformational substates that share similar structural features. Conclusions: Together, we show that the CVAE model can quantitatively describe complex biophysical processes such as protein folding.
Original language | English |
---|---|
Article number | 484 |
Journal | BMC Bioinformatics |
Volume | 19 |
DOIs | |
State | Published - Dec 21 2018 |
Funding
The authors would like to thank D. E. Shaw Research for providing access to the protein folding simulation trajectories of BBA and VHP. The authors also thank the MSMBuilder team for making their Fs-Peptide simulations available online. This manuscript has been authored by UT-Battelle, LLC under Contract No. DE-AC05-00OR22725 with the U.S. Department of Energy. The United States Government retains and the publisher, by accepting the article for publication, acknowledges that the United States Government retains a non-exclusive, paid-up, irrevocable, world-wide license to publish or reproduce the published form of the manuscript, or allow others to do so, for United States Government purposes. The Department of Energy will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan (http://energy.gov/downloads/doe-public-access-plan). This work has been supported in part by the Joint Design of Advanced Computing Solutions for Cancer (JDACS4C) program established by the U.S. Department of Energy (DOE) and the National Cancer Institute (NCI) of the National Institutes of Health. This work was performed under the auspices of the U.S. Department of Energy by Argonne National Laboratory under Contract DE-AC02-06-CH11357, Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344, Los Alamos National Laboratory under Contract DE-AC5206NA25396, Oak Ridge National Laboratory under Contract DE-AC05-00OR22725, and Frederick National Laboratory for Cancer Research under Contract HHSN261200800001E. This research used resources of the Oak Ridge Leadership Computing Facility at the Oak Ridge National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC05-00OR22725. Publications costs were funded in part by the Joint Design of Advanced Computing Solutions for Cancer (JDACS4C) program established by the U.S. Department of Energy (DOE) and the National Cancer Institute (NCI) of the National Institutes of Health and the Laboratory Director’s Research and Development Fund.
Funders | Funder number |
---|---|
National Institutes of Health | |
U.S. Department of Energy | |
National Cancer Institute | |
Office of Science | |
Argonne National Laboratory | DE-AC02-06-CH11357 |
Lawrence Livermore National Laboratory | DE-AC52-07NA27344 |
Oak Ridge National Laboratory | DE-AC05-00OR22725 |
Los Alamos National Laboratory | DE-AC5206NA25396 |
Frederick National Laboratory for Cancer Research | HHSN261200800001E |
Keywords
- Conformational substates
- Deep learning
- Protein folding
- Variational autoencoder