TY - JOUR
T1 - Automating Genetic Algorithm Mutations for Molecules Using a Masked Language Model
AU - Blanchard, Andrew E.
AU - Shekar, Mayanka Chandra
AU - Gao, Shang
AU - Gounley, John
AU - Lyngaas, Isaac
AU - Glaser, Jens
AU - Bhowmik, Debsindhu
N1 - Publisher Copyright:
© 1997-2012 IEEE.
PY - 2022/8/1
Y1 - 2022/8/1
N2 - Inspired by the evolution of biological systems, genetic algorithms have been applied to generate solutions for optimization problems in a variety of scientific and engineering disciplines. For a given problem, a suitable genome representation must be defined along with a mutation operator to generate subsequent generations. Unlike natural systems, which display a variety of complex rearrangements (e.g., mobile genetic elements), mutation for genetic algorithms commonly utilizes only random pointwise changes. Furthermore, generalizing beyond pointwise mutations poses a key difficulty as useful genome rearrangements depend on the representation and problem domain. To move beyond the limitations of manually defined pointwise changes, here we propose the use of techniques from masked language models to automatically generate mutations. As a first step, common subsequences within a given population are used to generate a vocabulary. The vocabulary is then used to tokenize each genome. A masked language model is trained on the tokenized data in order to generate possible rearrangements (i.e., mutations). In order to illustrate the proposed strategy, we use string representations of molecules and use a genetic algorithm to optimize for drug-likeness and synthesizability. Our results show that moving beyond random pointwise mutations accelerates genetic algorithm optimization.
AB - Inspired by the evolution of biological systems, genetic algorithms have been applied to generate solutions for optimization problems in a variety of scientific and engineering disciplines. For a given problem, a suitable genome representation must be defined along with a mutation operator to generate subsequent generations. Unlike natural systems, which display a variety of complex rearrangements (e.g., mobile genetic elements), mutation for genetic algorithms commonly utilizes only random pointwise changes. Furthermore, generalizing beyond pointwise mutations poses a key difficulty as useful genome rearrangements depend on the representation and problem domain. To move beyond the limitations of manually defined pointwise changes, here we propose the use of techniques from masked language models to automatically generate mutations. As a first step, common subsequences within a given population are used to generate a vocabulary. The vocabulary is then used to tokenize each genome. A masked language model is trained on the tokenized data in order to generate possible rearrangements (i.e., mutations). In order to illustrate the proposed strategy, we use string representations of molecules and use a genetic algorithm to optimize for drug-likeness and synthesizability. Our results show that moving beyond random pointwise mutations accelerates genetic algorithm optimization.
KW - Artificial intelligence
KW - bioinformatics
KW - genetic algorithms
KW - machine learning
UR - http://www.scopus.com/inward/record.url?scp=85123384459&partnerID=8YFLogxK
U2 - 10.1109/TEVC.2022.3144045
DO - 10.1109/TEVC.2022.3144045
M3 - Article
AN - SCOPUS:85123384459
SN - 1089-778X
VL - 26
SP - 793
EP - 799
JO - IEEE Transactions on Evolutionary Computation
JF - IEEE Transactions on Evolutionary Computation
IS - 4
ER -