TY - JOUR
T1 - Assessment of machine learning approaches for predicting the crystallization propensity of active pharmaceutical ingredients
AU - Ghosh, Ayana
AU - Louis, Lydie
AU - Arora, Kapildev K.
AU - Hancock, Bruno C.
AU - Krzyzaniak, Joseph F.
AU - Meenan, Paul
AU - Nakhmanson, Serge
AU - Wood, Geoffrey P.F.
N1 - Publisher Copyright:
© The Royal Society of Chemistry.
PY - 2019
Y1 - 2019
N2 - In the current report, three machine learning approaches were assessed for their ability to predict the crystallization propensities of a set of small organic compounds (<709 Da). The algorithms evaluated included: random forest regression (RFR), support vector machine regression (SVMR) and neural networks (NN). In addition to these algorithms, the influence of different molecular descriptors, the size of the training sets used, and various experimental factors on the predictive ability of the methods were also taken into consideration. For example, factors such as the solvent used, presence of impurities and/or degradants, influence of potential seeded crystallizations and implied supersaturation levels were explicitly investigated. For smaller training set sizes (e.g., ∼50), very little difference in the accuracy of the three algorithms was observed. However, beyond training set sizes of 150, the RFR algorithm typically outperformed the others by up to 20% RMSE. Additionally, as a result of the improved performance with larger training set sizes, the RFR models built with the explicit treatment of solvent typically outperformed models only considering the active pharmaceutical ingredient (API). For example, the best performing API only model had an RMSE of 30% whereas for the API + solvent models the RMSE was found to be 20%. Beyond inclusion of the solvent, it was found that the presence of impurities and/or degradants had the greatest influence on model accuracy. When these experiments were excluded, an additional improvement of up to 10% RMSE was observed in some cases.
AB - In the current report, three machine learning approaches were assessed for their ability to predict the crystallization propensities of a set of small organic compounds (<709 Da). The algorithms evaluated included: random forest regression (RFR), support vector machine regression (SVMR) and neural networks (NN). In addition to these algorithms, the influence of different molecular descriptors, the size of the training sets used, and various experimental factors on the predictive ability of the methods were also taken into consideration. For example, factors such as the solvent used, presence of impurities and/or degradants, influence of potential seeded crystallizations and implied supersaturation levels were explicitly investigated. For smaller training set sizes (e.g., ∼50), very little difference in the accuracy of the three algorithms was observed. However, beyond training set sizes of 150, the RFR algorithm typically outperformed the others by up to 20% RMSE. Additionally, as a result of the improved performance with larger training set sizes, the RFR models built with the explicit treatment of solvent typically outperformed models only considering the active pharmaceutical ingredient (API). For example, the best performing API only model had an RMSE of 30% whereas for the API + solvent models the RMSE was found to be 20%. Beyond inclusion of the solvent, it was found that the presence of impurities and/or degradants had the greatest influence on model accuracy. When these experiments were excluded, an additional improvement of up to 10% RMSE was observed in some cases.
UR - http://www.scopus.com/inward/record.url?scp=85061827952&partnerID=8YFLogxK
U2 - 10.1039/C8CE01589A
DO - 10.1039/C8CE01589A
M3 - Article
AN - SCOPUS:85061827952
SN - 1466-8033
VL - 21
SP - 1215
EP - 1223
JO - CrystEngComm
JF - CrystEngComm
IS - 8
ER -