.center[ .vertical-center[ # The big epidemiological questions To resolve history, must measure ![:emph](standing variation) within the MDRO population. Or at least of our sample ] ] -- Variation of what? * How many different plasmids carry a given carbapenemase? * <sub><sup>How to define plasmid w/ dynamic gene PA</sup></sub> * <sub><sup>Correlated w/ ST or permissive transmission?</sup></sub> -- * How clonal is each species' core genome? * <sub><sup>Find outbreak groups?</sub></sup> * <sub><sup>Evidence of recombination? </sub></sup> -- * Gene presence absence and structural rearrangements? --- # Resolving structural variation of carbapenemases with ONT ONT sequenced 115 carbapenemase producing gram negative bacteria from clinic. -- Illumina for polishing. Required for accurate gene prediction. -- .center[![:scale 700](/figs/basel/carb/overviewTable.svg)] .center[Assembled into complete, high quality genomes in an automated manner (with extensive manual validation)] --- # Resolving structural variation of carbapenemases with ONT .center[![:scale 500](/figs/basel/carb/contigSizes.svg)] -- .center[Allows us to assay the structural diversity of carbapenemase genomic contexts] --- # How to measure gene synteny of diverse molecules efficiently? .left-column[.center[![:scale 250](/figs/basel/carb/synteny_cartoon.svg)] * Use the gene clusters from PanX as our alphabet. * Align with Seqan <br> (exposed to Python)<sup>1</sup>] .footnote[<sup>1</sup>Available on [{GitHub}](https://github.com/nnoll)] -- .right-column[.center[![:scale 282](/figs/basel/carb/syntenymatrix.svg)] * Compute all pairwise alignments. * Matrix of edit distance defines structural clades.] --- # KPC is found in diverse genomic contexts .center[![:scale 300](/figs/basel/carb/plasmidTree.svg)] --- # KPC is found in diverse genomic contexts .center[ ![:scale 700](/figs/basel/carb/kpcPlasmids.svg) ] .footnote[ Base plot generated with ![:emph](MayDay): .cite[Herbig, A. et al., GenomeRing: alignment visualization based on SuperGenome coordinates, Bioinformatics., Jun, 2012 ] ] --- # NDM transposes. OXA-48 stable but promiscuous .left-column[.center[blaNDM]![:scale 380](/figs/basel/carb/blaNDM-4_1_JQ348841.svg)] -- .right-column[.center[blaOXA-48]![:scale 380](/figs/basel/carb/blaOXA-48_2_AY236073.svg)] --- class: center, middle .title[What about the chromosome?] --- # How to measure recombination events given only extant sequences? * Assume an infinite sites model (more realistically $\mu T_{tree} << 1$) -- * Homoplasies are (putatively) caused by past recombinations. .center[![:scale 600](/figs/basel/carb/homoplasy.svg)] --- # How to measure recombination events given only extant sequences? * Assume an infinite sites model (more realistically $\mu T_{tree} << 1$) * Homoplasies are (putatively) caused by past recombinations. * Infer mutational events using ML estimation on the fixed CG tree .center[![:scale 500](/figs/basel/carb/ancseq.svg)] .footnote[Tree was built using RaxML on concat. of CG] --- # How to measure recombination events given only extant sequences? * Assume an infinite sites model (more realistically $\mu T_{tree} << 1$) * Homoplasies are (putatively) caused by past recombinations. * Infer mutational events using ML estimation on the fixed CG tree .center[![:scale 500](/figs/basel/carb/ancseq_full.svg)] Compute the local density of homoplasies by performing for CG. .footnote[Inference using TreeTime's AncSeq class] --- # Klebsiella and Ecoli have puncatated ancestral recombinations Averaged results over 5 kB blocks -- .center[![:scale 700](/figs/basel/carb/klebHomoplasic.svg)] -- .center[![:scale 700](/figs/basel/carb/ecoliHomoplasic.svg)] --- # Approach the problem from a different angle. Tree-scanning .left-column[.center[![:scale 350](/figs/basel/carb/treescan.svg)]] -- .right-column[.left[![:scale 380](/figs/basel/carb/all_collinearBlocks_refined.png)] .center[(RF metric<sup>1</sup>)]] .footnote[<sup>1</sup> SPR metric would be more appropriate, however much more computationally intensive. ] * Cluster trees based on distance matrix. -- * Similar to Gubbins but conditioned on homoplasies -- * Rebuild trees based upon alignment clusters. --- # Klebsiella 'well' described by three major tree topologies .center[![:scale 700](/figs/basel/carb/klebHomoplasic_full.svg)] .footnote[Partition 3 (green) has been previously observed in .cite[Chen, L. et al. ASMB., May, 2014 ]] --- # Ecoli heterogeneous within homoplasy rich regions. .center[![:scale 700](/figs/basel/carb/ecoliHomoplasic_full.svg)] --- # Gene gain/loss is clock-like. Rearrangements random .third1[![:scale 250](/figs/basel/carb/panSizes.svg)] -- .third2[![:scale 300](/figs/basel/carb/pairwise_pa.svg)] -- .third2[![:scale 170](/figs/basel/carb/inversions.svg)] -- * Estimate the error of annotation in this step. `\(\sim\)` 20 genes/pair * Sublinear scaling distance vs number of non-shared genes * No tree-like structure in rearrangements --- # Takeaways and future directions * Epidemiologically harder problem than viruses. * <sub><sup>Russian doll of variation: SNPs on top of genes which are flanked by transposable elements that sit on communally shared plasmids. </sup></sub> * <sub><sup>Requires novel theories + data structures to predict/quantitatively say anything useful.</sup></sub> -- * Quantitative understanding of the spread of Carbapenemases will require deconvolving HGT + clonal growth. * <sub><sup>Are there more genomes susceptible to integrating specific transposons? Plasmids? </sup></sub> * <sub><sup>Reliably estimate the rate at which each event occurs? Distribution of fitness effect? </sup></sub> -- * Long read sequencing will be critical to quantitatively pinning down evolution of antibiotic resistance. Need a lot ![:emph](many more) full genomes. --- # Acknowledgements .left-column[![:scale 350](/figs/basel/carb/ack.png)] .right-column[ * My coauthors on the paper * <sup><sub>Eric Ulrich</sub></sup> * <sup><sub>Daniel Wurthrich</sub></sup> * <sup><sub>Vladimira Hinic</sub></sup> * <sup><sub>Adrian Egli</sub></sup> * <sup><sub>Richard Neher</sub></sup> * Organizers of this wonderful conference * You all for listening ] --- # Three types of expected errors 1. False nucleotides within the final assembly. -- 2. Erroneous short indels ![:emph](1-10 bp) -- 3. Global misassembly errors -- .center[![:scale 500](/figs/basel/carb/error_rates.svg)<figcaption>(A) Nanopore error rate (B) Illumina error rate</figcaption>] --- # Quantification of Error Type 1 * Map Illumina reads to final assembly. * Count the number of false SNPs in each column of pileup. * Is observed error rate in pileup ![:emph](statistically) consistent? -- .left-column[![:scale 350](/figs/basel/carb/illumina_error_rates.svg)] Take \\(n\\) and \\(\phi\\) to be the number of errors, and coverage of a given site `$$\mathcal{L}(n|\phi)= {\phi \choose n} p^n (1-p)^{\phi-n} $$` Assume distribution of coverage is Poisson `$$\rho(\phi)= \bar{\phi}^\phi e^{-\bar{\phi}} / \phi!$$` Use Bayes' Theorem --- # Quantification of Error Type 2 * Long-read sequencing has a known indel problem, even for consensus assemblies. -- * We used Canu due to unreliable results with Unicycler. -- * Spurious indels will lead to downstream gene prediction problems due to false premature stop codons - see [{Mick Watson's blog post}](http://www.opiniomics.org/on-stuck-records-and-indel-errors-or-stop-publishing-bad-genomes/) -- .left-column[ 1. Map all annotated proteins to SwissProt database. 2. Compare lengths all top hits with prc id > .75 3. Ask how many are shortened by 10 percent ] .right-column[ ![:scale 305](/figs/basel/carb/uniprot_length_compare.svg) ] --- # Quantification of Error Type 3 Look for regions of anomalously low Nanopore coverage. .center[![:scale 500](/figs/basel/carb/uniform_nanopore_coverage.svg)] --