APS Logo

GENERALIST: Generative Probabilistic Non-Linear Tensor Factorization Model for Proteins

ORAL

Abstract

Exploring the space of functional protein sequences beyond the naturally occurring ones requires generative models that leverage known natural sequences to learn the correlations between amino acid positions. For large protein sequences with datasets of limited sample size, inference of the protein sequence space could be challenging or infeasible. To address this gap, we present GENERALIST: a generative probabilistic model for protein sequences based on tensor factorization. GENERALIST infers a lower dimensional latent representation of the natural sequences which can then be used to generate novel sequences. The generated ensemble conserves several higher order statistics in the natural alignment. Additionally, GENERALIST also reproduces the statistics of the sequence ensemble, including distribution of nearest neighbor distances. Computational assessment of the sequence ensemble using AlphaFold2 suggests that the ensemble comprises structurally stable sequences. The model complexity in GENERALIST is tunable using the dimension of the latent space which allows us to control the tradeoff between accuracy and generality. This way, GENERALIST addresses the limitations of state of art generative models; the model accuracy is robust against the size of the natural protein sequence alignment and the length of the sequence. Notably, our framework is applicable to all types of categorical data including nucleotide sequences and binary data such as presence/absence of genes in genomes, neuronal spikes, etc.

Presenters

  • Hoda Akl

    University of Florida

Authors

  • Hoda Akl

    University of Florida

  • Brooke Emison

    University of Florida

  • Xiaochuan Zhao

    University of Florida

  • Purushottam Dixit

    University of Florida