APS Logo

Field theories for density estimation on sequence space

ORAL

Abstract

Density estimation on sequence space is a fundamental problem in machine learning that is of great importance in computational biology. Due to the discrete nature and the high dimensionality of sequence space, how best to estimate such densities from a sample of sequences remains unclear. We present a novel solution to this problem based on Bayesian field theory and spectral graph theory. Our method first identifies a one-parameter family of densities with the empirical frequency at one extreme and a maximum entropy (MaxEnt) estimate at the other. Notably, all densities in this family exactly match the marginal statistics that constrain the MaxEnt distribution. The optimal density within this family is then determined by cross validation. We demonstrate this method in two diverse biological contexts, human 5’ splice sites (49 possible RNA sequences) and karyotypes of human cancer (222 possible karyotypes). In both cases, our method yields density estimates that have richer structure than the corresponding MaxEnt estimates, better predict held-out test data, and enable visualizations that illuminate underlying biological mechanisms. Our method is thus an effective tool for analyzing biological sequence data, as well as other data types of a discrete combinatorial nature.

Presenters

  • Wei-Chia Chen

    Cold Spring Harbor Lab

Authors

  • Wei-Chia Chen

    Cold Spring Harbor Lab

  • Juannan Zhou

    Cold Spring Harbor Lab

  • Jason M Sheltzer

    Cold Spring Harbor Lab

  • Justin Block Kinney

    Cold Spring Harbor Lab, Cold Spring Harbor Laboratory

  • David M McCandlish

    Cold Spring Harbor Lab