presentation

.center[
.vertical-center[
# Observing bacterial pathogen evolution with long read sequencing

Nicholas Noll

Neher Lab

Biozentrum, University of Basel
]
]

---

# Sequence variation encodes the spread of pathogens
.center[![:scale 775](/figs/basel/ext/infection_tree_1.png)]
.footnote[Images by Trevor Bedford]

---

count: false
# Sequence variation encodes the spread of pathogens
.center[![:scale 775](/figs/basel/ext/infection_tree_2.png)]
.footnote[Images by Trevor Bedford]

---

count: false
# Sequence variation encodes the spread of pathogens
.center[![:scale 775](/figs/basel/ext/infection_tree_3b.png)]
.footnote[Images by Trevor Bedford]

---

count: false
# Sequence variation encodes the spread of pathogens
.center[![:scale 775](/figs/basel/ext/infection_tree_4b.png)]
.footnote[Images by Trevor Bedford]

---

count: false
# Sequence variation encodes the spread of pathogens
Prerequisites for epidemilogical techniques:
* Evolution generates enough variation

* Steele Bound: $n$ leaf tree can be inferred from sequenece of $O(\log N)$ if $\mu \sim .25$1
 * RNA virus mutates $\sim 10^{-5}$ per site per day. $\sim$ 1 SNP per week

* Sequencing samples enough of the population dynamics

* Molecular substrate is static - i.e. alignable
 * Bacteria mutates $\sim 10^{-8}$ per site per day. How static is the substrate?

.footnote[1.cite[Daskalakis et al. 2009]]

---

# "Understood" regime: successive mutations on a static sequence

.center[![:scale 700](/figs/basel/ext/align2.png)]

.center[All downstream analyses require sequence alignment from which to define polymorphisms and thus the degrees of freedom under evolution]

---

# Only models of mutations of static sequence

.left-column[.middle[![:scale 500](/figs/basel/ext/seq_evolve.svg)]]
.right-column[.middle[![:scale 500](/figs/basel/ext/muller_plots.jpg)]]

.footnote[1.cite[Beneficial Mutation-Selection Balance and the Effect of Linkage on Positive Selection. Michael Desai, Daniel Fisher]]

&nbsp;

&nbsp;
--

Theoretical understanding of
* scaling of average rate of mutations accumulation on $\mu, N, s$
* coalescent theory: how dynamics are reflected in statistics of underlying tree
* how to extract from data: can ![:emph](align) sequences and estimate tree

.center[No such null models of bacterial evolution.]

---

# Microbial evolution is different

&nbsp;
&nbsp;

.middle[.center[![:scale 450](/figs/basel/ext/HGT.png)]]

.center[Evolution of bacterial AMR doesn't fit mutational competition paradigm]

---

# Bacteria evolve by horizontally sharing genes

.center[![:scale 1000](/figs/basel/ext/panX_association.png)]

---

count: false
# Bacteria evolve by horizontally sharing genes

.left-column[
.center[
.middle[
![:scale 325](/figs/basel/carb/kleb_tree.png)
]]]

.middle[
.right-column[
&nbsp;

.center[
![:scale 450](/figs/basel/carb/pa_vs_divergence.png)
]
]
]

---

# Resolving HGT with long reads

.left-column[
Reconstruct history by sequencing
- Illumina reads: high coverage, short reads.
- Too short to bridge repetitive elements
- Fragmented assemblies
- Problem! most AMR genes are flanked by repetitive/mobile elements
.center[
![:scale 350](/figs/basel/ext/bad_assembly_graph.png)
]
]

.footnote[1.cite[.url[github.com/rrwick]]]

--
.right-column[
ONT long reads required to resolve structural diversity
.center[
![:scale 300](/figs/basel/carb/minIon.jpg)
![:scale 400](/figs/basel/carb/canu_to_spades.png)
]
]

---

# Global carbapenamase outbreak as case study.

.left-column[
* Reserve antibiotics used to treat MDR bacteria.
* First observed in the late 1980's
* Phenotypic resistence is conferred by multiple different genes
 - Growing public health problem.
 - Globally heterogeneous prevalence
* Facinating case study into deconvolving spread mediated by horizontal transfer and clonal expansion. 
]
.right-column[![:scale 500](/figs/basel/ext/carb_prevalence_eu.png)]

---

# Long-read sequencing of Carbapenemase producing bacteria
.center[![:scale 900](/figs/basel/carb/overview_table.svg)]

.third1[
&nbsp;
![:scale 350](/figs/basel/carb/contigSizes.svg)
]

.twoThirdsRight[
110 carbapenemase producing bacteria in Basel over $\sim$ 7 years.
* Hybrid assemblies resolve structural and nucleotide polymorphism.
* Short read contigs containing AMR genes avg. 6 genes long
 * Not enough diversity to reconstruct history 
* Have to verify assemblies of which no refs exist.
]

---

# High-quality genome assemblies

&nbsp;

.middle[
.center[
![:scale 1000](/figs/basel/carb/errorCharacterization.png)
]
]

---

# Goal: begin to enumerate structural "mutations"

How do we reconstruct evolutionary history in the horizontal regime from sequencing data?
* Tracking mutations on relevant genes not enough
 * Selection over $20$ years. $\sim 1$ kB region . 
 * Handful of mutations . 
* Most AMR genes are transferred via conjugative plasmids.
 * One-to-one correspondence? 
 * Are plasmids well approximated by static sequence? 
 * Correlations to ST? 
* Many AMR genes are embedded within transposable elements.

&nbsp;
.center[First step must be deciphering the ![:emph](relative rates) of each polymorphic generating event.]

---

# Genes as a coarse grained unit
Assume most bacterial variation on clinical time-scales occurs in both gene content and order (synteny).

Must computationally recognize orthologous gene clusters in our sample.

.third1[
&nbsp;
.middle[
.center[
![:scale 360](/figs/basel/ext/aaalign.jpg)

Align all ORF pairs w/ DIAMOND
]
]
]
--

.third2[
&nbsp;
.middle[
.center[
![:scale 250](/figs/basel/ext/mcl.jpeg)

MCL clustering
]
]
]
--

.third3[
&nbsp;
.middle[
.center[
![:scale 305](/figs/basel/ext/paralogy.png)

Paralogy splitting
]
]
]

.footnote[.cite[Ding, W. et al. panX: pan-genome analysis and exploration ] ]

---

# Syntenic alignment $\approx$ structural diversity
.left-column[
.center[
![:scale 250](/figs/basel/carb/syntenyCartoon.svg)
![:scale 250](/figs/basel/carb/syntenymatrix.svg)
]
.center[
Hierarchically cluster into "structural clades"
]
]

--
.right-column[
.center[
![:scale 300](/figs/basel/carb/kpc_synteny.svg)
]
]

.left-column[
* Syntenic changes resolve evolutionary relationships between plasmids
* Different $bla_{KPC}$ genes are found in same context
* Plasmids promiscuously shared across MLST and species
]

---

# Carbapenemases have varying signatures of HGT

.third1[
![:scale 300](/figs/basel/carb/kpc_synteny.svg)
]

.third2[
.center[
![:scale 250](/figs/basel/carb/ndm_synteny.svg)
]
]

.third3[
![:scale 240](/figs/basel/carb/oxa48_synteny.svg)
]

.block[
* $bla_{KPC}$: plasmid-bound. correlated w/ MLST and clone
* $bla_{NDM}$: high transposition rate. genome integration
* $bla_{OXA-48}$: high/low conjugation/transposition rate
]

---

# Problems with this analysis

* Sample size is just large enough to get a qualititative sense of the rates but not large enough to quantitatively measure.
* Extreme sensitivity to annotation errors
* Syntenic alignment not a proportional measure of evolutionary events -- e.g. inversions

&nbsp;
&nbsp;

.center[The next section is very much a work in progress! Thoughts and general grumpiness are welcomed.]

---

# Scaling up to a global picture

Extend our dataset:
* Perform the same comparison against ![:emph](all) carbapenemase carrying plasmids contained in the NCBI pathogen database.
* Compare against structural outgroup to estimate transposition

.center[$bla_{KPC}$]
.center[![:scale 360](/figs/basel/carb/kpc_global.png)]

.center[Most global structural "clades" are represented by our Basel sample.]

---

# Formalizing structural diversity as a graph
Generalize away from a fixed linear coordinate system to describe polymorphisms
* Each genome is represented as a closed path through a graph.
* Alignable regions are simply collinear paths.
* Better evolutionary distance measure than synteny alignment score.
* Structural variability of a particular locus = # paths.

.middle[.center[![:scale 1000](/figs/basel/carb/graph.png)]]

---

# Future outlook

&nbsp;
Can we start to make theoretical in-roads into basic questions regarding polymorphism at the molecular architecture level?
* How much variation in synteny should one expect given a quickly adapting molecule?
* Can we understand the statistics of the resultant structural trees? 
* How do rearrangement dynamics renormalize the statistics of the underlying gene tree?

&nbsp;

Complementary requirement. We need ![:emph](scalable) algorithms to deal with evolution in this limit.
* Multiple "plasmid" alignment in the face of structural rearrangements.
* Need a precise definition of a polymorphic degree of freedom to track.

---

# Acknowledgements
&nbsp;
&nbsp;

.twoThirdsLeft[![:scale 600](/figs/basel/carb/ackw.png)]

.block[
My collaborators
 * Eric Ulrich
 * Daniel Wurthrich
 * Vladimira Hinic
 * Adrian Egli
 * Richard Neher

You all for listening 
]