Canonicalizing BigSMILES for Polymer Informatics Using Chemical Intuition and State Machines
ORAL
Abstract
Based on and fully compatible with the extremely popular SMILES line notation, BigSMILES is a user-friendly, human-readable text-based notation for encoding polymers as random graphs that will be transformative in machine learning applications and polymer informatics. However, a single polymer can have many BigSMILES representations, which makes tasks like searching polymers by string difficult. We introduce two algorithms to canonicalize BigSMILES into a single unique string representation. In the first algorithm, the user writes BigSMILES repeat units according to the monomers from which they are derived, and the output is a BigSMILES string that is human readable. The second algorithm does not depend on the choice of repeat units but rather the connectivity of the polymer: we propose that any linear polymer ensemble is a regular language as defined in computer science, and an abstract model of machines called finite automaton can describe that ensemble. Using algorithms in automata theory, an automaton with the fewest number of states can be derived, thus providing a means of canonicalizing a BigSMILES. These algorithms will be impactful in enabling chemists to search polymer structures in databases using strings. Moreover, using the conceptual advance of polymers as state machines, these stochastic graph representations can also be used in machine learning applications in polymer informatics, connecting structure to property.
–
Publication: https://doi.org/10.1021/acspolymersau.2c00009
Presenters
-
Nathan J Rebello
Massachusetts Institute of Technology
Authors
-
Nathan J Rebello
Massachusetts Institute of Technology
-
Tzyy-Shyang Lin
Massachusetts Institute of Technology MIT
-
Guang-He Lee
Massachusetts Institute of Technology MIT
-
Melody A Morris
Massachusetts Institute of Technology MIT
-
Bradley D Olsen
Massachusetts Institute of Technology MIT, Massachusetts Institute of Technology