Data Augmentation and Pre-training for Template-Based Retrosynthetic Prediction
ORAL
Abstract
A key step in computer-aided synthesis planning (CASP) is the prioritization of candidate molecular transformations for retrosynthetic analysis. Recent methods obtaining state-of-the-art accuracy have used machine learning (ML) models as recommendation engines to rank reaction templates extracted from databases of recorded reactions. However, data scarcity limits the ability for ML models to recommend rare, often highly desired, transformations. In this work we discuss the augmentation of open-access reaction databases with synthetically generated molecular transformations to teach neural networks generalized template applicability. We use this as a pre-training strategy, which is followed by fine tuning of the model parameters using true, recorded reactions, in order to increase the diversity of suggested retrosynthetic transformations. While previous methods have focused on learning a one-to-one-mapping from featurized molecular inputs to a single template transformation, pre-training with general template applicability allows these new models to learn a one-to-many mapping to multiple templates. The implications of performing data augmentation and pre-training on different sized datasets is discussed, as well as the changes in performance for rare reaction templates.
–
Presenters
-
Mike Fortunato
Department of Chemical Engineering, Massachusetts Institute of Technology
Authors
-
Mike Fortunato
Department of Chemical Engineering, Massachusetts Institute of Technology
-
Connor Coley
Department of Chemical Engineering, Massachusetts Institute of Technology
-
Brian Barnes
Army Research Laboratory, Detonation Science and Modeling Branch, CCDC Army Research Laboratory, CCDC Army Research Laboratory, US Army Rsch Lab - Aberdeen
-
Klavs Jensen
Department of Chemical Engineering, Massachusetts Institute of Technology