A Superstatistical Variational Model for Categorical Data: Applications to Protein Sequence Variation
ORAL
Abstract
Understanding the constraints on amino acid variation in protein sequences within a protein family is crucial to our understanding of evolutionary and biophysical forces that dictate protein structure and function. Unfortunately, however, available sequences of naturally occuring proteins likely cover only a limited region of the vast space of functionally viable sequences. In order to paint a better picture of what makes an amino acid sequence a functional protein, we need generative models that can sample de novo protein sequences. To that end, we present a variational data-driven model rooted in superstatistics that leverages the pre-existing information present in natural sequences for generative purposes; and thus can sample new protein sequences from the inferred latent space. The generated sequences are significantly different from known sequences in the family but accurately reproduce several lower order statistics (frequencies, correlations, etc.). Moreover, probabilities of point mutations are predictive of fitness effects with predicted high probability mutations corresponding to near-neutral fitness costs. The developed formalism generalizes to other categorical data (neuronal firing, graph edges, etc.) and thus has a wide range of applications.
–
Presenters
-
Hoda Akl
University of Florida
Authors
-
Hoda Akl
University of Florida
-
Purushottam Dixit
University of Florida
-
Xiaochuan Zhao
University of Florida