APS Logo

Finding the Function-Determining Subset of Amino Acids in Protein Sequence Data

ORAL

Abstract

Energy-based models (EBM) fit to aligned sequences of a protein family have demonstrated the ability to generate novel functional protein sequences. This suggests that sequence-level statistics encode the salient features that underpin the structure and function of a protein. Understanding the extent to which EBMs can model such features is paramount to providing insight into their ability to sample new sequences, and consequently, insight into the biology. Specifically, we consider EBMs' ability to capture protein sectors, roughly 10 to 20 percent of total sequence positions that correlate strongly with biological functions. To this end, we fit pairwise models, Restricted Boltzmann machines (RBMs), and hybrid semi-Restricted Boltzmann machines (sRBMs) to synthetic sequences drawn from a minimal model endowed with a notion of sectors. The aim of incorporating an RBM is to model the sector. EBMs fit to the synthetic data are benchmarked by directly relating how well they model the sector to their generative performance. These benchmarks guide insight into the generative performance of EBMs that are fit to real data and are tested directly via lab experiments probing functionality of sampled sequences.

Presenters

  • Peter Fields

    University of Chicago

Authors

  • Peter Fields

    University of Chicago

  • Vudtiwat Ngampruetikorn

    The Graduate Center, CUNY, The Graduate Center, City University of New York

  • Rama Ranganathan

    University of Chicago

  • David J Schwab

    The Graduate Center, CUNY

  • Stephanie E Palmer

    University of Chicago