Field theories for density estimation on sequence space
ORAL
Abstract
Density estimation on sequence space is a fundamental problem in machine learning that is of great importance in computational biology. Due to the discrete nature and the high dimensionality of sequence space, how best to estimate such densities from a sample of sequences remains unclear. We present a novel solution to this problem based on Bayesian field theory and spectral graph theory. Our method first identifies a one-parameter family of densities with the empirical frequency at one extreme and a maximum entropy (MaxEnt) estimate at the other. Notably, all densities in this family exactly match the marginal statistics that constrain the MaxEnt distribution. The optimal density within this family is then determined by cross validation. We demonstrate this method in two diverse biological contexts, human 5’ splice sites (49 possible RNA sequences) and karyotypes of human cancer (222 possible karyotypes). In both cases, our method yields density estimates that have richer structure than the corresponding MaxEnt estimates, better predict held-out test data, and enable visualizations that illuminate underlying biological mechanisms. Our method is thus an effective tool for analyzing biological sequence data, as well as other data types of a discrete combinatorial nature.
–
Presenters
-
Wei-Chia Chen
Cold Spring Harbor Lab
Authors
-
Wei-Chia Chen
Cold Spring Harbor Lab
-
Juannan Zhou
Cold Spring Harbor Lab
-
Jason M Sheltzer
Cold Spring Harbor Lab
-
Justin Block Kinney
Cold Spring Harbor Lab, Cold Spring Harbor Laboratory
-
David M McCandlish
Cold Spring Harbor Lab