Data Generation for Machine Learning Interatomic Potentials and Beyond

Maksim Kulichenko, Benjamin Nebgen, Nicholas Lubbers, Justin S. Smith, Kipton Barros, Alice E.A. Allen, Adela Habib, Emily Shinkle, Nikita Fedik, Ying Wai Li, Richard A. Messerly, Sergei Tretiak

Research output: Contribution to journalReview articlepeer-review

Abstract

The field of data-driven chemistry is undergoing an evolution, driven by innovations in machine learning models for predicting molecular properties and behavior. Recent strides in ML-based interatomic potentials have paved the way for accurate modeling of diverse chemical and structural properties at the atomic level. The key determinant defining MLIP reliability remains the quality of the training data. A paramount challenge lies in constructing training sets that capture specific domains in the vast chemical and structural space. This Review navigates the intricate landscape of essential components and integrity of training data that ensure the extensibility and transferability of the resulting models. We delve into the details of active learning, discussing its various facets and implementations. We outline different types of uncertainty quantification applied to atomistic data acquisition and the correlations between estimated uncertainty and true error. The role of atomistic data samplers in generating diverse and informative structures is highlighted. Furthermore, we discuss data acquisition via modified and surrogate potential energy surfaces as an innovative approach to diversify training data. The Review also provides a list of publicly available data sets that cover essential domains of chemical space.

Original languageEnglish
JournalChemical Reviews
DOIs
StateAccepted/In press - 2024
Externally publishedYes

Funding

The work at Los Alamos National Laboratory (LANL) was supported by the LANL Directed Research and Development Funds (LDRD), the U.S. Department of Energy Office of Basic Energy Sciences (FWP: LANLE8B3) and performed in part at the Center for Nonlinear Studies (CNLS) and the Center for Integrated Nanotechnologies (CINT), a US Department of Energy (DOE) Office of Science user facility at LANL. M. K. and N. F. acknowledge the financial support from the Director\u2019s Postdoctoral Fellowship at LANL funded by LDRD. N. F. and A. E. A. A acknowledge support from Center for Nonlinear Studies (CNLS). K. B., R. A. M., B. N., A. E. A. A., and S.T. acknowledge support from the US DOE, Office of Science, Basic Energy Sciences, Chemical Sciences, Geosciences, and Biosciences Division under Triad National Security, LLC (\u201CTriad\u201D) Contract No. 89233218CNA000001 (FWP: LANLE3F2). This research used resources provided by the Los Alamos National Laboratory Institutional Computing Program, which is supported by the U.S. Department of Energy National Nuclear Security Administration under Contract No. 89233218CNA000001. We also acknowledge the CCS-7 Darwin cluster at LANL for additional computing resources. LANL is managed by Triad National Security, LLC, for the US DOE\u2019s NNSA, under Contract No. 89233218CNA000001.

FundersFunder number
National Nuclear Security Administration
Basic Energy Sciences
Center for Integrated Nanotechnologies
Los Alamos National Laboratory
LANL Directed Research and Development Funds
U.S. Department of Energy
Center for Nonlinear Studies
Office of Science
U.S. Department of Energy Office of Basic Energy SciencesLANLE8B3
Chemical Sciences, Geosciences, and Biosciences Division89233218CNA000001, LANLE3F2
Chemical Sciences, Geosciences, and Biosciences Division

    Fingerprint

    Dive into the research topics of 'Data Generation for Machine Learning Interatomic Potentials and Beyond'. Together they form a unique fingerprint.

    Cite this