A Superstatistical Variational Model for Categorical Data: Applications to Protein Sequence Variation

Hoda Akl; Purushottam Dixit; Xiaochuan Zhao

A Superstatistical Variational Model for Categorical Data: Applications to Protein Sequence Variation

ORAL

Abstract

Understanding the constraints on amino acid variation in protein sequences within a protein family is crucial to our understanding of evolutionary and biophysical forces that dictate protein structure and function. Unfortunately, however, available sequences of naturally occuring proteins likely cover only a limited region of the vast space of functionally viable sequences. In order to paint a better picture of what makes an amino acid sequence a functional protein, we need generative models that can sample de novo protein sequences. To that end, we present a variational data-driven model rooted in superstatistics that leverages the pre-existing information present in natural sequences for generative purposes; and thus can sample new protein sequences from the inferred latent space. The generated sequences are significantly different from known sequences in the family but accurately reproduce several lower order statistics (frequencies, correlations, etc.). Moreover, probabilities of point mutations are predictive of fitness effects with predicted high probability mutations corresponding to near-neutral fitness costs. The developed formalism generalizes to other categorical data (neuronal firing, graph edges, etc.) and thus has a wide range of applications.

March 15, 2022, 6:12 PM – March 15, 2022, 6:24 PM

Presenters

Hoda Akl

University of Florida

Authors

Hoda Akl

University of Florida
Purushottam Dixit

University of Florida
Xiaochuan Zhao

University of Florida