Abstract
The field of data-driven chemistry is undergoing an evolution, driven by innovations in machine learning models for predicting molecular properties and behavior. Recent strides in ML-based interatomic potentials have paved the way for accurate modeling of diverse chemical and structural properties at the atomic level. The key determinant defining MLIP reliability remains the quality of the training data. A paramount challenge lies in constructing training sets that capture specific domains in the vast chemical and structural space. This Review navigates the intricate landscape of essential components and integrity of training data that ensure the extensibility and transferability of the resulting models. We delve into the details of active learning, discussing its various facets and implementations. We outline different types of uncertainty quantification applied to atomistic data acquisition and the correlations between estimated uncertainty and true error. The role of atomistic data samplers in generating diverse and informative structures is highlighted. Furthermore, we discuss data acquisition via modified and surrogate potential energy surfaces as an innovative approach to diversify training data. The Review also provides a list of publicly available data sets that cover essential domains of chemical space.
Original language | English |
---|---|
Journal | Chemical Reviews |
DOIs | |
State | Accepted/In press - 2024 |
Externally published | Yes |
Funding
The work at Los Alamos National Laboratory (LANL) was supported by the LANL Directed Research and Development Funds (LDRD), the U.S. Department of Energy Office of Basic Energy Sciences (FWP: LANLE8B3) and performed in part at the Center for Nonlinear Studies (CNLS) and the Center for Integrated Nanotechnologies (CINT), a US Department of Energy (DOE) Office of Science user facility at LANL. M. K. and N. F. acknowledge the financial support from the Director\u2019s Postdoctoral Fellowship at LANL funded by LDRD. N. F. and A. E. A. A acknowledge support from Center for Nonlinear Studies (CNLS). K. B., R. A. M., B. N., A. E. A. A., and S.T. acknowledge support from the US DOE, Office of Science, Basic Energy Sciences, Chemical Sciences, Geosciences, and Biosciences Division under Triad National Security, LLC (\u201CTriad\u201D) Contract No. 89233218CNA000001 (FWP: LANLE3F2). This research used resources provided by the Los Alamos National Laboratory Institutional Computing Program, which is supported by the U.S. Department of Energy National Nuclear Security Administration under Contract No. 89233218CNA000001. We also acknowledge the CCS-7 Darwin cluster at LANL for additional computing resources. LANL is managed by Triad National Security, LLC, for the US DOE\u2019s NNSA, under Contract No. 89233218CNA000001.
Funders | Funder number |
---|---|
National Nuclear Security Administration | |
Basic Energy Sciences | |
Center for Integrated Nanotechnologies | |
Los Alamos National Laboratory | |
LANL Directed Research and Development Funds | |
U.S. Department of Energy | |
Center for Nonlinear Studies | |
Office of Science | |
U.S. Department of Energy Office of Basic Energy Sciences | LANLE8B3 |
Chemical Sciences, Geosciences, and Biosciences Division | 89233218CNA000001, LANLE3F2 |
Chemical Sciences, Geosciences, and Biosciences Division |