Canonicalizing BigSMILES for Polymer Informatics Using Chemical Intuition and State Machines

Nathan J Rebello; Tzyy-Shyang Lin; Guang-He Lee; Melody A Morris; Bradley D Olsen

Canonicalizing BigSMILES for Polymer Informatics Using Chemical Intuition and State Machines

ORAL

Abstract

Based on and fully compatible with the extremely popular SMILES line notation, BigSMILES is a user-friendly, human-readable text-based notation for encoding polymers as random graphs that will be transformative in machine learning applications and polymer informatics. However, a single polymer can have many BigSMILES representations, which makes tasks like searching polymers by string difficult. We introduce two algorithms to canonicalize BigSMILES into a single unique string representation. In the first algorithm, the user writes BigSMILES repeat units according to the monomers from which they are derived, and the output is a BigSMILES string that is human readable. The second algorithm does not depend on the choice of repeat units but rather the connectivity of the polymer: we propose that any linear polymer ensemble is a regular language as defined in computer science, and an abstract model of machines called finite automaton can describe that ensemble. Using algorithms in automata theory, an automaton with the fewest number of states can be derived, thus providing a means of canonicalizing a BigSMILES. These algorithms will be impactful in enabling chemists to search polymer structures in databases using strings. Moreover, using the conceptual advance of polymers as state machines, these stochastic graph representations can also be used in machine learning applications in polymer informatics, connecting structure to property.

March 9, 2023, 4:18 PM – March 9, 2023, 4:30 PM

Publication: https://doi.org/10.1021/acspolymersau.2c00009

Presenters

Nathan J Rebello

Massachusetts Institute of Technology

Authors

Nathan J Rebello

Massachusetts Institute of Technology
Tzyy-Shyang Lin

Massachusetts Institute of Technology MIT
Guang-He Lee

Massachusetts Institute of Technology MIT
Melody A Morris

Massachusetts Institute of Technology MIT
Bradley D Olsen

Massachusetts Institute of Technology MIT, Massachusetts Institute of Technology