Optimized Normalizing Flow for Molecular Discovery
ORAL
Abstract
Conventional materials design and discovery involves human domain knowledge to propose candidate materials, and subsequently for their synthesis and testing. This entire process is costly and time-consuming, therefore limiting the throughput and diversity of materials that are explored. Machine learning using generative models provides a promising solution. In this study, we built a normalizing flow (NF), optimized to allow quick and accurate discovery of novel small molecules. Firstly, SELFIES representations were converted from the SMILES representations of molecules randomly selected from the QM7, QM8, QM9 and ChemBL molecular datasets. Sampling molecules from more than one dataset increases their variety in the training data so that generalizability of the model is improved. Next, molecules were generated from a NF, while the hyperparameters of the NF, including the composition of the flow such as the number of hidden units in the autoregressive network, the number of NF layers, learning rate, and batch size, were optimized by multi-objective Bayesian optimization (MOBO) to minimize the mean divergences (Kullback-Leibler, Wasserstein) between the generated and training molecules. Molecules subsequently re-sampled from the target distribution of the NF with the optimal hyperparameters were screened to eliminate those with low similarity compared with the training molecules. The chemical validity, novelty, uniqueness, and internal diversity of the generated molecules were verified, and their synthetic accessibility score (SA score) and synthetic complexity score (SCScore) distributions were compared against those of the training molecules. Finally, the NF model was benchmarked using MOSES datasets to evaluate its quality.
–
Presenters
-
Jarvis Loh
Institute of High Performance Computing
Authors
-
Jarvis Loh
Institute of High Performance Computing