GENERALIST: Generative Probabilistic Non-Linear Tensor Factorization Model for Proteins

Hoda Akl; Brooke Emison; Xiaochuan Zhao; Purushottam Dixit

GENERALIST: Generative Probabilistic Non-Linear Tensor Factorization Model for Proteins

ORAL

Abstract

Exploring the space of functional protein sequences beyond the naturally occurring ones requires generative models that leverage known natural sequences to learn the correlations between amino acid positions. For large protein sequences with datasets of limited sample size, inference of the protein sequence space could be challenging or infeasible. To address this gap, we present GENERALIST: a generative probabilistic model for protein sequences based on tensor factorization. GENERALIST infers a lower dimensional latent representation of the natural sequences which can then be used to generate novel sequences. The generated ensemble conserves several higher order statistics in the natural alignment. Additionally, GENERALIST also reproduces the statistics of the sequence ensemble, including distribution of nearest neighbor distances. Computational assessment of the sequence ensemble using AlphaFold2 suggests that the ensemble comprises structurally stable sequences. The model complexity in GENERALIST is tunable using the dimension of the latent space which allows us to control the tradeoff between accuracy and generality. This way, GENERALIST addresses the limitations of state of art generative models; the model accuracy is robust against the size of the natural protein sequence alignment and the length of the sequence. Notably, our framework is applicable to all types of categorical data including nucleotide sequences and binary data such as presence/absence of genes in genomes, neuronal spikes, etc.

March 9, 2023, 3:06 PM – March 9, 2023, 3:18 PM

Presenters

Hoda Akl

University of Florida

Authors

Hoda Akl

University of Florida
Brooke Emison

University of Florida
Xiaochuan Zhao

University of Florida
Purushottam Dixit

University of Florida