.center[ .vertical-center[ # Observing bacterial pathogen evolution with long read sequencing Nicholas Noll Neher Lab Biozentrum, University of Basel ] ] --- # Sequence variation encodes the spread of pathogens .center[] .footnote[Images by Trevor Bedford] --- count: false # Sequence variation encodes the spread of pathogens .center[] .footnote[Images by Trevor Bedford] --- count: false # Sequence variation encodes the spread of pathogens .center[] .footnote[Images by Trevor Bedford] --- count: false # Sequence variation encodes the spread of pathogens .center[] .footnote[Images by Trevor Bedford] --- count: false # Sequence variation encodes the spread of pathogens Prerequisites for epidemilogical techniques: * Evolution generates enough variation * <sub><sup>Steele Bound: $n$ leaf tree can be inferred from sequenece of $O(\log N)$ if $\mu \sim .25$<sup>1</sup></sub></sup> * <sub><sup>RNA virus mutates $\sim 10^{-5}$ per site per day. $\sim$ 1 SNP per week</sub></sup> -- * Sequencing samples enough of the population dynamics -- * Molecular substrate is static - i.e. alignable * <sub><sup>Bacteria mutates $\sim 10^{-8}$ per site per day. How static is the substrate? </sub></sup> .footnote[<sup>1</sup>.cite[Daskalakis et al. 2009]] --- # "Understood" regime: successive mutations on a static sequence .center[] .center[All downstream analyses require sequence alignment from which to define polymorphisms and thus the degrees of freedom under evolution] --- # Only models of mutations of static sequence .left-column[.middle[]] .right-column[.middle[]] .footnote[<sup>1</sup>.cite[Beneficial Mutation-Selection Balance and the Effect of Linkage on Positive Selection. Michael Desai, Daniel Fisher]] -- Theoretical understanding of * scaling of average rate of mutations accumulation on $\mu, N, s$ * coalescent theory: how dynamics are reflected in statistics of underlying tree * how to extract from data: can  sequences and estimate tree -- .center[No such null models of bacterial evolution.] --- # Microbial evolution is different .middle[.center[]] -- .center[Evolution of bacterial AMR doesn't fit mutational competition paradigm] --- # Bacteria evolve by horizontally sharing genes .center[] --- count: false # Bacteria evolve by horizontally sharing genes .left-column[ .center[ .middle[  ]]] .middle[ .right-column[ .center[  ] ] ] --- # Resolving HGT with long reads .left-column[ Reconstruct history by sequencing - Illumina reads: high coverage, short reads. - Too short to bridge repetitive elements - Fragmented assemblies - Problem! most AMR genes are flanked by repetitive/mobile elements .center[  ] ] .footnote[<sup>1</sup>.cite[.url[github.com/rrwick]]] -- .right-column[ ONT long reads required to resolve structural diversity .center[   ] ] --- # Global carbapenamase outbreak as case study. .left-column[ * Reserve antibiotics used to treat MDR bacteria. * First observed in the late 1980's * Phenotypic resistence is conferred by multiple different genes - <sub><sup>Growing public health problem.</sub></sup> - <sub><sup>Globally heterogeneous prevalence</sub></sup> * Facinating case study into deconvolving spread mediated by horizontal transfer and clonal expansion. ] .right-column[] --- # Long-read sequencing of Carbapenemase producing bacteria .center[] -- .third1[  ] .twoThirdsRight[ 110 carbapenemase producing bacteria in Basel over $\sim$ 7 years. * Hybrid assemblies resolve structural and nucleotide polymorphism. * Short read contigs containing AMR genes avg. 6 genes long * <sub><sup> Not enough diversity to reconstruct history </sub></sup> * Have to verify assemblies of which no refs exist. ] --- # High-quality genome assemblies .middle[ .center[  ] ] --- # Goal: begin to enumerate structural "mutations" How do we reconstruct evolutionary history in the horizontal regime from sequencing data? * Tracking mutations on relevant genes not enough * <sup><sub> Selection over $20$ years. $\sim 1$ kB region </sup></sub>. * <sup><sub> Handful of mutations </sup></sub>. * Most AMR genes are transferred via conjugative plasmids. * <sup><sub> One-to-one correspondence? </sup></sub> * <sup><sub> Are plasmids well approximated by static sequence?</sup></sub> * <sup><sub> Correlations to ST? </sup></sub> * Many AMR genes are embedded within transposable elements. -- .center[First step must be deciphering the  of each polymorphic generating event.] --- # Genes as a coarse grained unit Assume most bacterial variation on clinical time-scales occurs in both gene content and order (synteny). -- Must computationally recognize orthologous gene clusters in our sample. -- .third1[ .middle[ .center[  Align all ORF pairs w/ DIAMOND ] ] ] -- .third2[ .middle[ .center[  MCL clustering ] ] ] -- .third3[ .middle[ .center[  Paralogy splitting ] ] ] .footnote[.cite[Ding, W. et al. panX: pan-genome analysis and exploration ] ] --- # Syntenic alignment $\approx$ structural diversity .left-column[ .center[   ] .center[ Hierarchically cluster into "structural clades" ] ] -- .right-column[ .center[  ] ] .left-column[ * Syntenic changes resolve evolutionary relationships between plasmids * Different $bla_{KPC}$ genes are found in same context * Plasmids promiscuously shared across MLST and species ] --- # Carbapenemases have varying signatures of HGT .third1[  ] .third2[ .center[  ] ] .third3[  ] -- .block[ * $bla_{KPC}$: plasmid-bound. correlated w/ MLST and clone * $bla_{NDM}$: high transposition rate. genome integration * $bla_{OXA-48}$: high/low conjugation/transposition rate ] --- # Problems with this analysis * Sample size is just large enough to get a qualititative sense of the rates but not large enough to quantitatively measure. * Extreme sensitivity to annotation errors * Syntenic alignment not a proportional measure of evolutionary events -- e.g. inversions -- .center[The next section is very much a work in progress! Thoughts and general grumpiness are welcomed.] --- # Scaling up to a global picture Extend our dataset: * Perform the same comparison against  carbapenemase carrying plasmids contained in the NCBI pathogen database. * Compare against structural outgroup to estimate transposition -- .center[$bla_{KPC}$] .center[] .center[Most global structural "clades" are represented by our Basel sample.] --- # Formalizing structural diversity as a graph Generalize away from a fixed linear coordinate system to describe polymorphisms * Each genome is represented as a closed path through a graph. * Alignable regions are simply collinear paths. * Better evolutionary distance measure than synteny alignment score. * Structural variability of a particular locus = # paths. -- .middle[.center[]] --- # Future outlook Can we start to make theoretical in-roads into basic questions regarding polymorphism at the molecular architecture level? * How much variation in synteny should one expect given a quickly adapting molecule? * Can we understand the statistics of the resultant structural trees? * How do rearrangement dynamics renormalize the statistics of the underlying gene tree? -- Complementary requirement. We need  algorithms to deal with evolution in this limit. * Multiple "plasmid" alignment in the face of structural rearrangements. * Need a precise definition of a polymorphic degree of freedom to track. --- # Acknowledgements .twoThirdsLeft[] .block[ My collaborators * <sup><sub>Eric Ulrich</sub></sup> * <sup><sub>Daniel Wurthrich</sub></sup> * <sup><sub>Vladimira Hinic</sub></sup> * <sup><sub>Adrian Egli</sub></sup> * <sup><sub>Richard Neher</sub></sup> You all for listening ]