APS Logo

Optimization and historical contingency in protein sequences

ORAL · Invited

Abstract

Protein sequences are shaped by functional optimization on the one hand and by evolutionary history, i.e. phylogeny, on the other hand. A multiple sequence alignment of homologous proteins contains sequences which evolved from the same ancestral sequence and have similar structure and function. In such an alignment, correlations in amino-acid usage at different sites can arise from structural and functional constraints due to coevolution, but also from historical contingency.

Correlations arising from phylogeny often confound coevolution signal from functional or structural optimization, impairing the inference of structural contacts from sequences. However, inferred Potts models are more robust than local statistics to these effects, which may explain their success [1]. Dedicated corrections can further increase this robustness [2]. Moreover, phylogenetic correlations can in fact provide useful information for some inference tasks, especially to infer interaction partners from sequences among the paralogs of two protein families. In this case, signal from phylogeny and signal from constraints combine constructively [3], and explicitly exploiting both further improves inference performance [4].

Protein language models have recently been applied to sequence data, greatly advancing structure, function and mutational effect prediction. Language models trained on multiple sequence alignments capture coevolution and structural contacts, but also phylogenetic relationships [5]. They are able to disentangle signal from structural constraints and from phylogeny more efficiently than Potts models [5], and they have promising generative properties [6].

Publication: [1] Dietler N, Lupo U, Bitbol A-F (2022) "Impact of phylogeny on structural contact inference from protein sequence data", https://arxiv.org/abs/2209.13045<br>[2] Colavin A, Atolia E, Bitbol A-F, Huang KC (2022) "Extracting phylogenetic dimensions of coevolution reveals hidden functional signals", Scientific Reports 12(1):820<br>[3] Gerardos A, Dietler N, Bitbol A-F (2022) "Correlations from structure and phylogeny combine constructively in the inference of protein partners from sequences", PLoS Computational Biology 18(5): e1010147<br>[4] Gandarilla-Perez CA, Pinilla S, Bitbol A-F, Weigt M (2022) "Combining phylogeny and coevolution improves the inference of interaction partners among paralogous proteins", https://arxiv.org/abs/2208.11626<br>[5] Lupo U, Sgarbossa D, Bitbol A-F (2022) "Protein language models trained on multiple sequence alignments learn phylogenetic relationships", https://arxiv.org/abs/2203.15465<br>[6] Sgarbossa D, Lupo U, Bitbol A-F (2022) "Generative power of a protein language model trained on multiple sequence alignments", https://arxiv.org/abs/2204.07110

Presenters

  • Anne-Florence Bitbol

    EPFL, Ecole Polytechnique Federale de Lausanne

Authors

  • Anne-Florence Bitbol

    EPFL, Ecole Polytechnique Federale de Lausanne