Abstract
Transformer-based large language models have remarkable potential to accelerate design optimization for applications such as drug development and material discovery. Self-supervised pretraining of transformer models requires large-scale data sets, which are often sparsely populated in topical areas such as polymer science. State-of-the-art approaches for polymers conduct data augmentation to generate additional samples but unavoidably incur extra computational costs. In contrast, large-scale open-source data sets are available for small molecules and provide a potential solution to data scarcity through transfer learning. In this work, we show that using transformers pretrained on small molecules and fine-tuned on polymer properties achieves comparable accuracy to those trained on augmented polymer data sets for a series of benchmark prediction tasks.
Original language | English |
---|---|
Pages (from-to) | 7689-7698 |
Number of pages | 10 |
Journal | Journal of Chemical Information and Modeling |
Volume | 63 |
Issue number | 24 |
DOIs | |
State | Published - Dec 25 2023 |
Funding
The research described here was supported by the Exascale Computing Project (17-SC-20-SC), a collaborative effort of the US Department of Energy Office of Science and the National Nuclear Security Administration. This research used resources from the Oak Ridge Leadership Computing Facility, which is a DOE Office of Science User Facility supported under contract DE-AC0500OR22725. L.K. and A.K.N. acknowledge support from the US Department of Energy, Office of Science, Basic Energy Sciences, under contract ERKCK60. This manuscript has been authored by UT-Battelle LLC under contract DE-AC05-00OR22725 with the US Department of Energy (DOE). The US government retains and the publisher, by accepting the article for publication, acknowledges that the US government retains a nonexclusive, paid-up, irrevocable, worldwide license to publish or reproduce the published form of this manuscript, or allow others to do so, for US government purposes. DOE will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan (https://www.energy.gov/doe-public-access-plan). The research described here was supported by the Exascale Computing Project (17-SC-20-SC), a collaborative effort of the US Department of Energy Office of Science and the National Nuclear Security Administration. This research used resources from the Oak Ridge Leadership Computing Facility, which is a DOE Office of Science User Facility supported under contract DE-AC0500OR22725. L.K. and A.K.N. acknowledge support from the US Department of Energy, Office of Science, Basic Energy Sciences, under contract ERKCK60. This manuscript has been authored by UT-Battelle LLC under contract DE-AC05-00OR22725 with the US Department of Energy (DOE). The US government retains and the publisher, by accepting the article for publication, acknowledges that the US government retains a nonexclusive, paid-up, irrevocable, worldwide license to publish or reproduce the published form of this manuscript, or allow others to do so, for US government purposes. DOE will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan ( https://www.energy.gov/doe-public-access-plan ).