presentation

.center[
.vertical-center[
# The big epidemiological questions
To resolve history, must measure ![:emph](standing variation) within the MDRO population.  Or at least of our sample
]
]

Variation of what?

* How many different plasmids carry a given carbapenemase?
 * How to define plasmid w/ dynamic gene PA
 * Correlated w/ ST or permissive transmission?

--
* How clonal is each species' core genome? 
 * Find outbreak groups?
 * Evidence of recombination?

--
* Gene presence absence and structural rearrangements?

---

# Resolving structural variation of carbapenemases with ONT

ONT sequenced 115 carbapenemase producing gram negative bacteria from clinic.

Illumina for polishing. Required for accurate gene prediction.

&nbsp;
.center[![:scale 700](/figs/basel/carb/overviewTable.svg)]

.center[Assembled into complete, high quality genomes in an automated manner (with extensive manual validation)]

---

# Resolving structural variation of carbapenemases with ONT

.center[![:scale 500](/figs/basel/carb/contigSizes.svg)]

.center[Allows us to assay the structural diversity of carbapenemase genomic contexts]

---

# How to measure gene synteny of diverse molecules efficiently?
.left-column[.center[![:scale 250](/figs/basel/carb/synteny_cartoon.svg)]
* Use the gene clusters from PanX as our alphabet.
* Align with Seqan (exposed to Python)1]

.footnote[1Available on [{GitHub}](https://github.com/nnoll)]

.right-column[.center[![:scale 282](/figs/basel/carb/syntenymatrix.svg)]
* Compute all pairwise alignments. 
* Matrix of edit distance defines structural clades.]

---

# KPC is found in diverse genomic contexts
.center[![:scale 300](/figs/basel/carb/plasmidTree.svg)]

---
# KPC is found in diverse genomic contexts

.center[ ![:scale 700](/figs/basel/carb/kpcPlasmids.svg) ]
.footnote[ Base plot generated with ![:emph](MayDay): .cite[Herbig, A. et al., GenomeRing: alignment visualization based on SuperGenome coordinates, Bioinformatics., Jun, 2012 ] ]

---

# NDM transposes. OXA-48 stable but promiscuous

.left-column[.center[blaNDM]![:scale 380](/figs/basel/carb/blaNDM-4_1_JQ348841.svg)]

.right-column[.center[blaOXA-48]![:scale 380](/figs/basel/carb/blaOXA-48_2_AY236073.svg)]

---

class: center, middle
.title[What about the chromosome?]

---

# How to measure recombination events given only extant sequences?

* Assume an infinite sites model (more realistically $\mu T_{tree} << 1$)

--
* Homoplasies are (putatively) caused by past recombinations.

.center[![:scale 600](/figs/basel/carb/homoplasy.svg)]

---

# How to measure recombination events given only extant sequences?

* Assume an infinite sites model (more realistically $\mu T_{tree} << 1$)
* Homoplasies are (putatively) caused by past recombinations.
* Infer mutational events using ML estimation on the fixed CG tree

.center[![:scale 500](/figs/basel/carb/ancseq.svg)]

.footnote[Tree was built using RaxML on concat. of CG]

---

# How to measure recombination events given only extant sequences?

.center[![:scale 500](/figs/basel/carb/ancseq_full.svg)]

Compute the local density of homoplasies by performing for CG.

.footnote[Inference using TreeTime's AncSeq class]

---

# Klebsiella and Ecoli have puncatated ancestral recombinations
Averaged results over 5 kB blocks

&nbsp;
.center[![:scale 700](/figs/basel/carb/klebHomoplasic.svg)]

.center[![:scale 700](/figs/basel/carb/ecoliHomoplasic.svg)]

---

# Approach the problem from a different angle. Tree-scanning
.left-column[.center[![:scale 350](/figs/basel/carb/treescan.svg)]]

.right-column[.left[![:scale 380](/figs/basel/carb/all_collinearBlocks_refined.png)]
.center[(RF metric1)]]
.footnote[1 SPR metric would be more appropriate, however much more computationally intensive. ]

* Cluster trees based on distance matrix.

--
* Similar to Gubbins but conditioned on homoplasies

--
* Rebuild trees based upon alignment clusters.

---

# Klebsiella 'well' described by three major tree topologies 
&nbsp;
.center[![:scale 700](/figs/basel/carb/klebHomoplasic_full.svg)]
.footnote[Partition 3 (green) has been previously observed in .cite[Chen, L. et al. ASMB., May, 2014 ]]

---

# Ecoli heterogeneous within homoplasy rich regions.  
&nbsp;
.center[![:scale 700](/figs/basel/carb/ecoliHomoplasic_full.svg)]

---

# Gene gain/loss is clock-like. Rearrangements random

.third1[![:scale 250](/figs/basel/carb/panSizes.svg)]

--
.third2[![:scale 300](/figs/basel/carb/pairwise_pa.svg)]

--
.third2[![:scale 170](/figs/basel/carb/inversions.svg)]

--
&nbsp;      
* Estimate the error of annotation in this step. `$\sim$` 20 genes/pair 
* Sublinear scaling distance vs number of non-shared genes
* No tree-like structure in rearrangements

---

# Takeaways and future directions

* Epidemiologically harder problem than viruses. 
 * Russian doll of variation: SNPs on top of genes which are flanked by transposable elements that sit on communally shared plasmids. 
 * Requires novel theories + data structures to predict/quantitatively say anything useful.

* Quantitative understanding of the spread of Carbapenemases will require deconvolving HGT + clonal growth.
 * Are there more genomes susceptible to integrating specific transposons? Plasmids? 
 * Reliably estimate the rate at which each event occurs? Distribution of fitness effect?

* Long read sequencing will be critical to quantitatively pinning down evolution of antibiotic resistance. Need a lot ![:emph](many more) full genomes.

---

# Acknowledgements

.left-column[![:scale 350](/figs/basel/carb/ack.png)]

.right-column[
* My coauthors on the paper
 * Eric Ulrich
 * Daniel Wurthrich
 * Vladimira Hinic
 * Adrian Egli
 * Richard Neher

* Organizers of this wonderful conference

* You all for listening 
]

---

# Three types of expected errors 
1. False nucleotides within the final assembly.
--

2. Erroneous short indels ![:emph](1-10 bp)
--

3. Global misassembly errors
--
&nbsp; 
&nbsp; 
.center[![:scale 500](/figs/basel/carb/error_rates.svg)<figcaption>(A) Nanopore error rate (B) Illumina error rate</figcaption>]

---

# Quantification of Error Type 1
* Map Illumina reads to final assembly.
* Count the number of false SNPs in each column of pileup.
* Is observed error rate in pileup ![:emph](statistically) consistent?
--

.left-column[![:scale 350](/figs/basel/carb/illumina_error_rates.svg)]

Take \$n\$ and \$\phi\$ to be the number of errors, and coverage of a given site
`$$\mathcal{L}(n|\phi)= {\phi \choose n} p^n (1-p)^{\phi-n} $$`
Assume distribution of coverage is Poisson
`$$\rho(\phi)= \bar{\phi}^\phi e^{-\bar{\phi}} / \phi!$$`
Use Bayes' Theorem

---

# Quantification of Error Type 2
* Long-read sequencing has a known indel problem, even for consensus assemblies.

--
* We used Canu due to unreliable results with Unicycler.

--
* Spurious indels will lead to downstream gene prediction problems due to false premature stop codons - see [{Mick Watson's blog post}](http://www.opiniomics.org/on-stuck-records-and-indel-errors-or-stop-publishing-bad-genomes/)  
&nbsp;

--
.left-column[  1. Map all annotated proteins to SwissProt database.  
2. Compare lengths all top hits with prc id > .75    
3. Ask how many are shortened by 10 percent ]
.right-column[ ![:scale 305](/figs/basel/carb/uniprot_length_compare.svg) ]

---

# Quantification of Error Type 3
Look for regions of anomalously low Nanopore coverage.
.center[![:scale 500](/figs/basel/carb/uniform_nanopore_coverage.svg)]