ORNL_AISD-Ex: Quantum chemical prediction of UV/Vis absorption spectra for over 10 million organic molecules

Dataset

Description

We performed calculations of electronic excitation energies and associated oscillator strengths based on the time-dependent density-functional tight-binding (TD-DFTB) method [1]. The SMILES (Simplified molecular-input line-entry system) strings of the molecules from the AISD HOMO-LUMO database [2] were converted to a 3D atomistic structure and stored in a PDB file after preliminary geometry optimization using the Merck Molecular Force Field (MMFF94) in RDKit [3,4]. The primary information stored in the PDB file archive consists of Cartesian coordinates for each atom of the molecule in their 3D location in space, along with summary information about the structure, sequence, and experiment. We then performed molecular geometry optimization using the density-functional tight-binding (DFTB) method [5] in the electronic ground state, followed by single-point excited states calculations, as described below. We note that, since RDKit employs a random choice for the generation of molecular conformers, the molecular geometries obtained in this dataset could be different from the ones that were generated when the AISD HOMO-LUMO dataset was generated. The computed excitation energies and associated oscillator strengths can be converted to predict UV/Vis absorption spectra, where excitation energies correspond to absorption peak positions, and oscillator strengths are a good measure of the probability of absorption of visible or UV light in transitions between electronic ground and excited states. The conversion of SMILES strings to 3D Cartesian coordinates of fully DFTB-optimized molecules was successful for 10,502,904 out of 10,502,917 molecules. For these molecules, both geometry optimizations and excited states calculations were successful. The DFTB calculations did not complete for 13 molecules of the original AISD HOMO-LUMO dataset. We still provide information about the geometry of these molecules. The molecules are diverse for chemical compositions (which span 5 non-hydrogen elements: oxygen, carbon, nitrogen, fluorine, sulfur) and molecular size (the smallest molecule contains 5 non-hydrogen atoms, and the largest molecule contains 71 non-hydrogen atoms). The DFTB method [5] is an approximation to density functional theory (DFT), utilizing a minimal basis set in conjunction with a two-center approximation to the electronic Hamiltonian and overlap matrix elements. The DFTB total energy is the sum of an electronic and a repulsive energy contribution, and their calculation requires optimized electronic parameters and diatomic repulsive potential energy functions. All DFTB calculations were performed using the DFTB+ code [6] (version 21.2) and the wrapper for DFTB+ in the Atomic Simulation Environment (ASE) (version 3.22.1) [7], which performed an internal conversion of Cartesian coordinates from PDB to the .gen file format. For the geometry optimizations on the electronic ground state potential energy surface of the molecules, we have chosen the third-order DFTB (DFTB3) method [5c] and employed the matching 3ob set of electronic parameters and repulsive potentials [8]. The empirical γ-damping for hydrogen bond correction, and Grimme’s D3 empirical dispersion correction with Becke-Johnson damping (D3(BJ)) [9] dispersion correction was included to improve the description of non-covalent interactions. For excited states single-point energy calculations, we employed the TD-DFTB method in conjunction with the DFTB2 method [5b] and the matching mio [5b,10] and halorg [11] parameter sets. We opted to request the simultaneous calculation of 50 excited states for singlet transition to investigate sufficient number of excited states, based on linear response theory using the Casida equation [Ref: T. A. Niehaus, S. Suhai, F. Della Sala, P Lugli, M. Elstner, G. Seifert, and Th. Frauenheim. Tight-binding approach to time-dependent density-functional response theory. Phys. Rev. B, 63:085108, 2001] and the ARPACK diagonalizer [R. B. Lehoucq, D. C. Sorensen, and C. Yang. Arpack users guide: Solution of large-scale eigenvalue problems by implicitly restarted arnoldi methods, 1997. 46, 51]. The dataset contains 1001 tar.gz files. Tar files are named as “ornl_aisd_ex_1.tar.gz” through “ornl_aisd_ex_1000.tar.gz”. Additionally, the 13 failed molecules are in “ornl_aisd_ex_unprocessed.tar.gz”. Except for the tar files listed below, each tar file contains 10,500 molecules. Tar files numbered 34, 121, 128, 352, 360, 429, 495, 509, 518, 627, 676, 668, and 862 contain 10,499 molecules each. The last tar file numbered 1000 contains 13,417 molecules. The total size of the uncompressed dataset is over 283 Gigabytes. The code for calculating the electronic excitation energies and statistical analysis of the dataset is provided at the following GitLab repository: https://github.com/ORNL/Analysis-of-Large-Scale-Molecular-Datasets-with-Python Calculating the UV spectrum of a molecule requires performing 3 main operations: 1. Converting the smiles string representation of a molecule into a geometric structure where each atom is assigned XYZ coordinates. The geometric structure is written to the file smiles.pdb. 2. Using smiles.pdb to compute the relaxed geometry of the molecule, which corresponds with the position of the atoms at the position of equilibrium at the ground state. This generates the files band.out, detailed.out, and geo_end.gen. 3. Using geo_end.gen to calculate the UV spectrum of the molecule which is written into the file EXC.DAT. Every molecule in the dataset has its own directory. The files contained in each molecule directory are as follows: 1. geo_end.gen 2. detailed.out 3. band.out 4. EXC.DAT 5. smiles.pdb REFERENCES [1] Niehaus, T. A.; Suhai, S.; Della Salla, F.; Lugli, P.; Elstner, M.; Seifert, G.; Frauenheim, Th. Tight-binding approach to time-dependent density-functional response theory. Phys. Rev. B, 2001, 63, 085108/1-9. [2] Blanchard, A.; Gounley, J.; Metha, K.; Yoo, P.; Irle, S. AISD HOMO-LUMO. DOI: 10.13139/ORNLNCCS/1869409 [3] RDKit: Cheminformatics and Machine Learning Software. 2013, [http://www.rdkit.org] [4] Tosco, P.; Stiefl, N. & Landrum, G. Bringing the MMFF force field to the RDKit: implementation and validation. J Cheminform. 2014, 6, 1–4. [5] a) Porezag, D.; Frauenheim, T.; Kohler, T.; Seifert, G.; Kaschner, Construction of tight-binding-like potentials on the basis of density-functional theory: Application to carbon, R. Phys. Rev. B 1995, 51, 12947-12957; b) Elstner, M.; Porezag, D.; Jungnickel, G.; Elsner, J.; Haugk, M.; Frauenheim, Th.; Suhai, S.; Seifert, G.; Phys. Rev. B 1998, 58, 7260-7268; c) Gaus, M.; Cui, Q.; Elstner, M. DFTB3: Extension of the Self-Consistent-Charge Density-Functional Tight-Binding Method (SCC-DFTB), J. Chem. Theory Comput. 2011, 7, 931-948; d) Cui, Q.; Elstner, M. Density functional tight binding: values of semi-empirical methods in an ab initio era, Phys. Chem. Chem. Phys. 2014, 16, 14368-14377. [6] Hourahine, B. et al. DFTB+, a software package for efficient approximate density functional theory based atomistic simulations, J. Chem. Phys. 2020, 152, 124101/1-19. [7] Larsen, A. H. et al. The atomic simulation environmental Python library for working with atoms. J. Phys.: Cond. Matter 2017, 29, 273002. [8] Kubillus, M.; Kubar, T.; Gaus, M.; Rezac, J.; Elstner, M. Parameterization of the DFTB3 Method for Br, Ca, Cl, F, I, K, and Na in Organic and Biological Systems, J. Chem. Theory Comput. 2015, 11, 332-342. [9] Brandenburg, J. G.; Grimme, S. Accurate Modeling of Organic Molecular Crystals by Dispersion-Corrected Density Functional Tight Binding (DFTB), J. Phys. Chem. Lett. 2014, 5, 1785-1789. [10] a) Niehaus, T. A.; Elstner, M.; Frauenheim, Th.; Suhai, S. Application of an approximate density-functional method to sulfur containing compounds. J. Mol. Struct.: THEOCHEM 2001, 541, 185-94; b) Elstner, M.; Hobza, P.; Frauenheim, Th.; Suhai, S.; Kaxiras, E. Hydrogen bonding and stacking interactions of nucleic acid base pairs: A density-functional-theory based treatment. J. Chem. Phys. 2001, 114, 5149-55. [11] Kubar, T.; Bodrog, Z.; Gaus, M.; KÃhler, C.; Aradi, B.; Frauenheim, Th.; Elstner, M. Parametrization of the SCC-DFTB Method for Halogens. J. Chem. Theory Comput. 2013, 9, 2939-49.

Cite this