.center[ .vertical-center[ # The big epidemiological questions To resolve history, must measure  within the MDRO population. Or at least of our sample ] ] -- Variation of what? * How many different plasmids carry a given carbapenemase? * <sub><sup>How to define plasmid w/ dynamic gene PA</sup></sub> * <sub><sup>Correlated w/ ST or permissive transmission?</sup></sub> -- * How clonal is each species' core genome? * <sub><sup>Find outbreak groups?</sub></sup> * <sub><sup>Evidence of recombination? </sub></sup> -- * Gene presence absence and structural rearrangements? --- # Resolving structural variation of carbapenemases with ONT ONT sequenced 115 carbapenemase producing gram negative bacteria from clinic. -- Illumina for polishing. Required for accurate gene prediction. -- .center[] .center[Assembled into complete, high quality genomes in an automated manner (with extensive manual validation)] --- # Resolving structural variation of carbapenemases with ONT .center[] -- .center[Allows us to assay the structural diversity of carbapenemase genomic contexts] --- # How to measure gene synteny of diverse molecules efficiently? .left-column[.center[] * Use the gene clusters from PanX as our alphabet. * Align with Seqan <br> (exposed to Python)<sup>1</sup>] .footnote[<sup>1</sup>Available on [{GitHub}](https://github.com/nnoll)] -- .right-column[.center[] * Compute all pairwise alignments. * Matrix of edit distance defines structural clades.] --- # KPC is found in diverse genomic contexts .center[] --- # KPC is found in diverse genomic contexts .center[  ] .footnote[ Base plot generated with : .cite[Herbig, A. et al., GenomeRing: alignment visualization based on SuperGenome coordinates, Bioinformatics., Jun, 2012 ] ] --- # NDM transposes. OXA-48 stable but promiscuous .left-column[.center[blaNDM]] -- .right-column[.center[blaOXA-48]] --- class: center, middle .title[What about the chromosome?] --- # How to measure recombination events given only extant sequences? * Assume an infinite sites model (more realistically $\mu T_{tree} << 1$) -- * Homoplasies are (putatively) caused by past recombinations. .center[] --- # How to measure recombination events given only extant sequences? * Assume an infinite sites model (more realistically $\mu T_{tree} << 1$) * Homoplasies are (putatively) caused by past recombinations. * Infer mutational events using ML estimation on the fixed CG tree .center[] .footnote[Tree was built using RaxML on concat. of CG] --- # How to measure recombination events given only extant sequences? * Assume an infinite sites model (more realistically $\mu T_{tree} << 1$) * Homoplasies are (putatively) caused by past recombinations. * Infer mutational events using ML estimation on the fixed CG tree .center[] Compute the local density of homoplasies by performing for CG. .footnote[Inference using TreeTime's AncSeq class] --- # Klebsiella and Ecoli have puncatated ancestral recombinations Averaged results over 5 kB blocks -- .center[] -- .center[] --- # Approach the problem from a different angle. Tree-scanning .left-column[.center[]] -- .right-column[.left[] .center[(RF metric<sup>1</sup>)]] .footnote[<sup>1</sup> SPR metric would be more appropriate, however much more computationally intensive. ] * Cluster trees based on distance matrix. -- * Similar to Gubbins but conditioned on homoplasies -- * Rebuild trees based upon alignment clusters. --- # Klebsiella 'well' described by three major tree topologies .center[] .footnote[Partition 3 (green) has been previously observed in .cite[Chen, L. et al. ASMB., May, 2014 ]] --- # Ecoli heterogeneous within homoplasy rich regions. .center[] --- # Gene gain/loss is clock-like. Rearrangements random .third1[] -- .third2[] -- .third2[] -- * Estimate the error of annotation in this step. `\(\sim\)` 20 genes/pair * Sublinear scaling distance vs number of non-shared genes * No tree-like structure in rearrangements --- # Takeaways and future directions * Epidemiologically harder problem than viruses. * <sub><sup>Russian doll of variation: SNPs on top of genes which are flanked by transposable elements that sit on communally shared plasmids. </sup></sub> * <sub><sup>Requires novel theories + data structures to predict/quantitatively say anything useful.</sup></sub> -- * Quantitative understanding of the spread of Carbapenemases will require deconvolving HGT + clonal growth. * <sub><sup>Are there more genomes susceptible to integrating specific transposons? Plasmids? </sup></sub> * <sub><sup>Reliably estimate the rate at which each event occurs? Distribution of fitness effect? </sup></sub> -- * Long read sequencing will be critical to quantitatively pinning down evolution of antibiotic resistance. Need a lot  full genomes. --- # Acknowledgements .left-column[] .right-column[ * My coauthors on the paper * <sup><sub>Eric Ulrich</sub></sup> * <sup><sub>Daniel Wurthrich</sub></sup> * <sup><sub>Vladimira Hinic</sub></sup> * <sup><sub>Adrian Egli</sub></sup> * <sup><sub>Richard Neher</sub></sup> * Organizers of this wonderful conference * You all for listening ] --- # Three types of expected errors 1. False nucleotides within the final assembly. -- 2. Erroneous short indels  -- 3. Global misassembly errors -- .center[<figcaption>(A) Nanopore error rate (B) Illumina error rate</figcaption>] --- # Quantification of Error Type 1 * Map Illumina reads to final assembly. * Count the number of false SNPs in each column of pileup. * Is observed error rate in pileup  consistent? -- .left-column[] Take \\(n\\) and \\(\phi\\) to be the number of errors, and coverage of a given site `$$\mathcal{L}(n|\phi)= {\phi \choose n} p^n (1-p)^{\phi-n} $$` Assume distribution of coverage is Poisson `$$\rho(\phi)= \bar{\phi}^\phi e^{-\bar{\phi}} / \phi!$$` Use Bayes' Theorem --- # Quantification of Error Type 2 * Long-read sequencing has a known indel problem, even for consensus assemblies. -- * We used Canu due to unreliable results with Unicycler. -- * Spurious indels will lead to downstream gene prediction problems due to false premature stop codons - see [{Mick Watson's blog post}](http://www.opiniomics.org/on-stuck-records-and-indel-errors-or-stop-publishing-bad-genomes/) -- .left-column[ 1. Map all annotated proteins to SwissProt database. 2. Compare lengths all top hits with prc id > .75 3. Ask how many are shortened by 10 percent ] .right-column[  ] --- # Quantification of Error Type 3 Look for regions of anomalously low Nanopore coverage. .center[] --