Physics-Based Machine Learning Workflows and Large Language Models are Here for Physics, (Bio)Chemistry, and Drug Discovery.
ORAL · Invited
Abstract
In silico methods, such as virtual screening or de novo design based on deep generative models, have emerged as powerful tools to propose hit compounds in drug discovery. However, such methods still suffer from a relatively low success rate which is caused by inadequate study of the binding behavior between ligand and target proteins. We introduce the machine learning algorithm, iMiner, that generates novel inhibitor molecules for target proteins by combining deep reinforcement learning (RL) with real-time 3D molecular docking, thereby simultaneously creating chemical novelty while constraining molecules for shape and molecular compatibility with target active sites. The iMiner algorithm is further distinguished from other generative models through its algorithmic versatility, with capabilities in driving de novo molecular design, generating molecules via “scaffold hopping”), and structure-based design (growth from bound ligand fragments and/or enforcing ligand interaction with certain protein sites). Further important attributes of the iMiner algorithm include quickly filtering out undesirable structures (e.g., pan-assay interference compounds, Lipinski R05 violators, substructures with toxicity liabilities, synthetically accessible compounds) and multiple cross-validation strategies (other docking/scoring functions, automation of molecular dynamics simulations to measure pose stability, and evaluation of absolute and relative binding free energies). We also introduce SmileyLlama based on training the Llama3.1 large language model (LLM) using supervised fine-tuning and direct preference optimization to respond to prompts such as generating molecules with properties of interest to drug development. This allows an LLM to not just be a chatbot client for chemistry and materials tasks, but can be adapted to speak more directly as a chemical language model which can generate molecules with user-specified properties. The iMiner and SmileyLlama algorithms and workflow have been successfully applied to discover inhibitors targeting the SARS-CoV-2 helicase.
–
Publication: Y. E. Wang*, K. O. Sun*, J. Li, X. Guan, O. Zhang, D. Bagni, T. Head-Gordon (2024). PDBBind Optimization to Create a High-Quality Protein-Ligand Binding Dataset for Binding Affinity Prediction. *equal contribution. <br>J. M. Cavanagh, K. Sun, A. Gritsevskiy, D. Bagni, T. D. Bannister, T. Head-Gordon (2024). SmileyLlama: Modifying Large Language Models for Directed Chemical Space Exploration https://arxiv.org/abs/2409.02231<br>J. Li, O. Zhang, F. L. Kearns, M. Haghighatlari, C. Parks, R. E. Amaro, T. Head-Gordon (2024). Mining for Potent Inhibitors through Artificial Intelligence and Physics: A Unified Methodology for Ligand Based and Structure Based Drug Design. J. Chem. Inform. Model., published online https://pubs.acs.org/doi/10.1021/acs.jcim.4c00634