Statistical Inference of DNA Translocation using Parallel Expectation Maximization

ORAL

Abstract

DNA translocation through a nanopore is an attractive candidate for a next-generation DNA sequencing platform, however the stochastic motion of the molecules within the pore, allowing both forward and backward movement, prevents easy inference of the true sequence from observed data. We model diffusion of an input DNA sequence through a nanopore as a biased random walk with noise, and describe an algorithm for efficient statistical reconstruction of the input sequence, given data consisting of a set of time series traces. The data is modeled as a Hidden Markov Model, and parallel expectation maximization is used to learn the most probable input sequence generating the observed traces. Bounds on inference accuracy are analyzed as a function of model parameters, including forward bias, error rate, and the number of traces. The number of traces is shown to have the strongest influence on algorithm performance, allowing for high inference accuracy even in extremely noisy environments. Incorrectly identified state transitions account for the majority of inference errors, and we introduce entropy-based metaheuristics for identifying and eliminating these errors. Inference is robust, fast, and scales to input sequences on the order of several kilobases.

Authors

  • Kevin Emmett

    Columbia University

  • Jacob Rosenstein

    Columbia University

  • David Pfau

    Columbia University

  • Akiva Bamberger

    Columbia University

  • Kenneth Shepard

    Columbia University

  • Chris Wiggins

    Columbia University