GENERALIST: Generative Probabilistic Non-Linear Tensor Factorization Model for Proteins
ORAL
Abstract
Exploring the space of functional protein sequences beyond the naturally occurring ones requires generative models that leverage known natural sequences to learn the correlations between amino acid positions. For large protein sequences with datasets of limited sample size, inference of the protein sequence space could be challenging or infeasible. To address this gap, we present GENERALIST: a generative probabilistic model for protein sequences based on tensor factorization. GENERALIST infers a lower dimensional latent representation of the natural sequences which can then be used to generate novel sequences. The generated ensemble conserves several higher order statistics in the natural alignment. Additionally, GENERALIST also reproduces the statistics of the sequence ensemble, including distribution of nearest neighbor distances. Computational assessment of the sequence ensemble using AlphaFold2 suggests that the ensemble comprises structurally stable sequences. The model complexity in GENERALIST is tunable using the dimension of the latent space which allows us to control the tradeoff between accuracy and generality. This way, GENERALIST addresses the limitations of state of art generative models; the model accuracy is robust against the size of the natural protein sequence alignment and the length of the sequence. Notably, our framework is applicable to all types of categorical data including nucleotide sequences and binary data such as presence/absence of genes in genomes, neuronal spikes, etc.
–
Presenters
-
Hoda Akl
University of Florida
Authors
-
Hoda Akl
University of Florida
-
Brooke Emison
University of Florida
-
Xiaochuan Zhao
University of Florida
-
Purushottam Dixit
University of Florida