.center[ .vertical-center[ # Understanding the physics of living systems with advances in sequencing data Nicholas Noll Neher Lab Biozentrum, University of Basel ] ] --- class: middle, center # Part 1: Observing bacterial pathogen evolution with long read sequencing --- # Antibiotic resistance as an arms race .center[] --- # Carbapenem resistance is quickly spreading .left-column[.center[]] .right-column[.center[]] .footnote[.cite[ECDC Rapid Risk Assessment: Carbapenem-resistant Enterobacteriaceae (2016)]] -- .left-column[ #### Surveillance needed * Drug reserved for multi-resistant bugs * Resistance conferred by many genes * Estimate rates to educate decisions ] -- .right-column[ #### An uncontrolled evolution experiment * Strong selective pressure for 3 decades * Learn something fundamental - <sub><sup> e.g. Genes or genomes? </sub></sup> ] --- # Organization of the resistant bacterial genome .middle[ .center[  ] ] --- count:false # Organization of the resistant bacterial genome .middle[ .center[  ] ] --- count:false # Organization of the resistant bacterial genome .middle[ .center[  ] ] --- # Sequences encode evolutionary history .center[] --- count:false # Sequences encode evolutionary history .center[] --- count:false # Sequences encode evolutionary history .center[] --- count:false # Sequences encode evolutionary history .center[] --- count:false # Sequences encode evolutionary history .center[] -- #### Resolving history hinges on * Enough samples: resolve all _clades_ -- * Enough variation: 1 RNA virus mutates $10^{-5}$/site/day $\rightarrow 1$ fixed mutation/week -- * Tree-like evolution --- # Organization of the resistant bacterial genome .middle[ .center[  ] ] --- count:false # Organization of the resistant bacterial genome .middle[ .center[  ] ] --- count:false # Bacteria evolve by horizontal transfer .left-column[  ] .footnote[[1] .cite[Sakoparnig, T. et al. biorxiv 2019] [2] .cite[Olivier, P.H. et al. Nat. Comm. 2017] [3] .cite[Anderson, R. et al. Annu. Rev. Gen. 2009]] -- .right-column[ #### Extensive homologous recombination * Tree-model of evolution not valid - <sub><sup>e.g. $\sim 25\%$ of mutations consistent w/ 1 tree<sup>[1]</sup></sub></sup> ] -- .right-column[ #### Uptake of genes from environment * Pan-genome * Distributed or localized in genome?<sup>[2]</sup> ] -- .right-column[ #### Genome reorganization * Gene duplication & loss - <sub><sup>$10^{-4}$ /cell/gen<sup>[3]</sup></sub></sup> * Transposable & conjugative elems ] --- # Resolving genome organization requires long reads .left-column[ #### Complete & accurate de-novo assembly is hard * : high coverage, short reads * Too short to bridge repetitive elements * : Resistance genes flanked by repetitive elements ] -- .right-column[ .center[ .middle[  ] ] ] .left-column[ .center[  ] ] --- count:false # Resolving genome organization requires long reads .left-column[ #### Complete & accurate de-novo assembly is hard * : high coverage, short reads * Too short to bridge repetitive elements * : Resistance genes flanked by repetitive elements ] .right-column[ .center[ .middle[  ] ] ] .left-column[ .center[  ] ] --- count:false # Resolving genome organization requires long reads .left-column[ #### Complete & accurate de-novo assembly is hard * : high coverage, short reads * Too short to bridge repetitive elements * : Resistance genes flanked by repetitive elements ] .right-column[ .center[ .middle[  ] ] ] .left-column[ .center[  ] ] --- count:false # Resolving genome organization requires long reads .left-column[ #### Complete & accurate de-novo assembly is hard * : high coverage, short reads * Too short to bridge repetitive elements * : Resistance genes flanked by repetitive elements ] .right-column[ .center[ .middle[  ] ] ] .left-column[ .center[  ] ] .footnote[[1] .cite[.url[github.com/rrwick]]] --- count:false # Resolving genome organization requires long reads .left-column[ #### Complete & accurate de-novo assembly is hard * : high coverage, short reads * Too short to bridge repetitive elements * : Resistance genes flanked by repetitive elements ] .right-column[ .center[ .middle[  ] ] ] .left-column[ .center[  ] ] .footnote[[1] .cite[.url[github.com/rrwick]]] --- count:false # Resolving genome organization requires long reads .left-column[ #### Complete & accurate de-novo assembly is hard * : high coverage, short reads * Too short to bridge repetitive elements * : Resistance genes flanked by repetitive elements ] .right-column[ .center[ .middle[ #### Biozentrum ONT sequencing center  ] ] ] .left-column[ .center[  ] ] .footnote[[1] .cite[.url[github.com/rrwick]]] --- count:false # Resolving genome organization requires long reads .left-column[ #### Complete & accurate de-novo assembly is hard * : high coverage, short reads * Too short to bridge repetitive elements * : Resistance genes flanked by repetitive elements ] .right-column[ .center[ .middle[  ] ] ] .left-column[ .center[  ] ] .footnote[[1] .cite[.url[github.com/rrwick]]] --- count:false # Resolving genome organization requires long reads .left-column[ #### Complete & accurate de-novo assembly is hard * : high coverage, short reads * Too short to bridge repetitive elements * : Resistance genes flanked by repetitive elements ] .right-column[ .center[ .middle[  ] ] ] .left-column[ .center[  ] ] .footnote[[1] .cite[.url[github.com/rrwick]]] --- count:false # Resolving genome organization requires long reads .left-column[ #### Complete & accurate de-novo assembly is hard * : high coverage, short reads * Too short to bridge repetitive elements * : Resistance genes flanked by repetitive elements ] .right-column[ .center[ .middle[  ] ] ] .left-column[ .center[  ] ] .footnote[[1] .cite[.url[github.com/rrwick]]] --- count:false # Resolving genome organization requires long reads .left-column[ #### Complete & accurate de-novo assembly is hard * : high coverage, short reads * Too short to bridge repetitive elements * : Resistance genes flanked by repetitive elements ] .right-column[ #### Hybrid assembly is complete & accurate .center[   ] ] .left-column[ .center[  ] ] .footnote[[1] .cite[.url[github.com/rrwick]]] --- count:false # Resolving genome organization requires long reads .left-column[ #### Illumina accuracy \\(n\\)= # of errors. \\(\phi\\)= coverage. \\(\rho(\phi)\\) is Poisson `$$\mathcal{L}(n)= {\phi \choose n} p^n (1-p)^{\phi-n} \rho(\phi)$$` .center[] ] .right-column[ #### Hybrid assembly is complete & accurate .center[   ] ] .footnote[[1] .cite[.url[github.com/rrwick]]] --- # 110 carbapenem-resistant bacteria in Basel over 7 years .left-column[] .left-column[] -- * Isolated from patients at University Hospital - <sup><sub>Chosen based on exhibiting carb. resistant phenotype </sup></sub> -- * One of largest sequencing studies of its kind to date - <sup><sub>Increased # of complete Klepne by $\sim 10\%$ </sub></sup> - <sup><sub>Supplemented w/ $\sim 700$ complete carb. genomes available from GenBank</sub></sup> --- count:false # 110 carbapenem-resistant bacteria in Basel over 7 years .left-column[] .left-column[] * Isolated from patients at University Hospital - <sup><sub>Chosen based on exhibiting carb. resistant phenotype</sup></sub> * One of largest sequencing studies of its kind to date - <sup><sub>Increased # of complete Klepne by $\sim 10\%$ </sub></sup> - <sup><sub>Supplemented w/ $\sim 700$ complete carb. genomes available from GenBank</sub></sup> .center[**Use reconstruction of whole plasmids to study dynamics**] --- # Resistance plasmids are not static .center[  ] .footnote[.cite[Silke, Peter et al. Tracking of antibiotic resistance transfer and rapid plasmid evolution in a hospital setting by Nanopore sequencing. biorxiv 2019]] -- #### Given our samples, we measure: * How many plasmids is each carbapenemase found on * Rate at which plasmids are transferred * Rate at which plasmids exchange genes --- count:false # Current comparative techniques don't scale .center[  ] --- # Collect plasmids into a synteny graph .center[] --- count:false # Collect plasmids into a synteny graph .center[] --- count:false # Collect plasmids into a synteny graph .center[] --- count:false # Collect plasmids into a synteny graph .center[] --- count:false # Collect plasmids into a synteny graph .center[] --- count:false # Collect plasmids into a synteny graph .center[] --- count:false # Collect plasmids into a synteny graph .center[] --- count:false # Collect plasmids into a synteny graph .center[] --- # Synteny graph constructed "bottom-up" .twoThirdsLeft[] * Construct a guide tree by kmer similarity --- count:false # Synteny graph constructed "bottom-up" .twoThirdsLeft[] * Construct a guide tree by kmer similarity * Initialize leafs with singleton graphs --- count:false # Synteny graph constructed "bottom-up" .twoThirdsLeft[] * Construct a guide tree by kmer similarity * Initialize leafs with singleton graphs * Align homologous regions of closest pair --- count:false # Synteny graph constructed "bottom-up" .twoThirdsLeft[] #### Pairwise alignment * Sketch each block of graph into minimizers - <sub><sup>Random hash function $\varphi(kmer) \to \mathbb{Z}$</sub></sup> - <sub><sup>Minimizer: min of $\varphi$ in $w$ kmer window</sub></sup> - <sub><sup>Exact match of length $w+k$</sub></sup> -- * Seed alignment at equal minimizers -- * Chain with dynamic programming -- * Alternatively use Minimap2 -- * Accept based on information theoretic criterion .footnote[.cite[Roberts, M. et al. Reducing storage requirements for biological sequence comparison. Bioinformatics (2004)]] --- count:false # Synteny graph constructed "bottom-up" .twoThirdsLeft[] * Construct a guide tree by kmer similarity * Initialize leafs with singleton graphs * Align homologous regions of closest pair * Iterate "up" tree passing graph towards root --- # An interactive way to view multiple genome alignments .center[ <video width="840" height="600" controls loop> <source src="/vids/intro.mp4" type="video/mp4"> </video> ] --- # Has evolution shaped the diversity of paths? .middle[ .center[] ] --- # The core "backbone" of a plasmid .center[] --- count:false # The core "backbone" of a plasmid .center[] --- count:false # The core "backbone" of a plasmid .center[] --- # Blocks encompass multiple genes .left-column[ .center[  ] ] .right-column[ .center[  ] ] --- # Blocks are approximately clonal .left-column[ .center[  ] ] .right-column[ .center[  ] ] --- # Count blocks as evolutionary events .right-column[ Simple interplasmid distance: # breakpoints .center[] ] --- count:false # Count blocks as evolutionary events .left-column[ .center[] .center[**Plasmids A/B**] ] .right-column[ Simple interplasmid distance: # breakpoints .center[] ] --- count:false # Count blocks as evolutionary events .left-column[ .center[] .center[**Plasmids A/C**] ] .right-column[ Simple interplasmid distance: # breakpoints .center[] ] --- count:false # Count blocks as evolutionary events .left-column[ .center[] .center[**Plasmids B/C**] ] .right-column[ Simple interplasmid distance: # breakpoints .center[] ] --- # Structural tree captures history .center[   ] --- count:false # Structural tree captures history .center[   ] --- count:false # Structural tree captures history .center[   ] -- #### Repeat for other carbapenemases * OXA-48 found on one plasmid; many different genomes * NDM found on many plasmids; only transposon shared --- count:false # Structural tree captures history .center[   ] #### Repeat for other carbapenemases * OXA-48 found on one plasmid; many different genomes * NDM found on many plasmids; only transposon shared --- # Surveillance of resistance requires new approaches -- .left-column[ .center[ .middle[ _Traditional approach insufficient due to HGT_  ] ] ] -- .right-column[ .middle[ .center[ _Long read sequencing to resolve genome_  ] ] ] -- .left-column[ .middle[ .center[ _Plasmid alignment infers rearrangements_ ] ] ] -- .right-column[ .middle[ .center[ _Structural comparison provides history_  ] ] ] --- # Acknowledgements .twoThirdsLeft[] .block[ My collaborators * <sup><sub>Eric Ulrich</sub></sup> * <sup><sub>Daniel Wurthrich</sub></sup> * <sup><sub>Vladimira Hinic</sub></sup> * <sup><sub>Adrian Egli</sub></sup> * <sup><sub>Richard Neher</sub></sup> You all for listening ] --- class: middle, center # Part 2: Positional information as dimensional reduction --- # Development is connection of heredity to form .left-column[ .center[  ] ] -- .right-column[ .middle[ .center[  ] ] ] -- .right-column[.center[** Ultimate problem in pattern formation **]] --- # Advances in scRNAseq permits new analyses .middle[ .center[  ] ] --- # How does a cell know where it is? Hydrogen atom of morphological patterning: Symmetry broken by diffusable molecule -- .left-column[ .center[  ] ] -- .right-column[ .center[  ] ] -- .center[**Morphogens provide positional information**] --- # Patterning of the Drosophila AP axis is example .twoThirdsLeft[ #### Bcd instructs genetic networks to pattern AP fates  ] -- .third3[ #### Stripes form pupal segments .middle[  ] ] --- # Patterning of the Drosophila AP axis is example .twoThirdsLeft[ #### Bcd instructs genetic networks to pattern AP fates  ] .third3[ #### Stripes form pupal segments .middle[  ] ] --- # Positional information of gap gene profiles Optically measure along midsaggital plane for many embryos .center[  ] -- Large gradients in expression allow for more positional information `$$\sigma_x^{-2} = \displaystyle\sum_{i,j=1}^4 \partial_x \bar{g}_i C_{ij} \partial_x \bar{g}_j$$` .footnote[.cite[Dubuis, J. et al. Positional information, in bits. (2013) PNAS]] --- count:false # Positional information of gap gene profiles Cells can determine where they are up to $\sim 1\%$ along AP .center[  ] Large gradients in expression allow for more positional information `$$\sigma_x^{-2} = \displaystyle\sum_{i,j=1}^4 \partial_x \bar{g}_i C_{ij} \partial_x \bar{g}_j$$` .footnote[.cite[Dubuis, J. et al. Positional information, in bits. (2013) PNAS]] --- # Reimagine as a many body problem scRNAseq technology allows high-throughput analyses .left-column[ .center[  ] ] .center[] --- count:false # Reimagine as a many body problem scRNAseq technology allows high-throughput analyses .left-column[ .center[  ] ] .right-column[ **Have to solve the spatial inverse problem** * Lose spatial position upon dissociation * Use obtained expression data to map onto embryo - <sub><sup>Supervised: Align against database</sub></sup> - <sub><sup>Unsupervised: Maximize mutual info</sub></sup> ] .center[] --- # Supervised approach to spatial inference .middle[.center[]] --- count:false # Supervised approach to spatial inference #### Definitions * $\alpha$ indexes genes: $\alpha \in \{1, 2, ..., G\}$ * $i$ indexes sequenced cells: $i \in \{1, 2, ..., N\}$ * $a$ indexes cells on embryo we map to: $a \in \{1, 2, ... C\}$ -- #### Given a database of known gene expression patterns on embryo * $\chi_{\alpha, a}$ denotes normalized expression pattern of database * Utilize atlases from BDTNP <sup>1</sup> .footnote[.cite[Fowlkes, C, et al. Registering Drosophila Embryos at Cellular Resolution to Build a Quantitative 3D Atlas of Gene Expression Patterns (2008)] <sup>1</sup>] -- #### Match sequenced cells onto database with Optimal Transport * $x_{\alpha, i}$ denotes normalized expression state of $i^{th}$ cell * $\psi_{i, a}$ denotes the probability that cell $i$ maps to position $a$ * Find assignment that minimizes the degree of "mismatch" -- `$$ F = -\displaystyle\sum\limits_{i, a, \alpha} x_{\alpha, i} \psi_{i, a} \chi_{\alpha, a} + T^{-1} \displaystyle\sum\limits_{i, a} \psi_{i, a} \log\psi_{i, a} $$` --- # Drosophila expression at single cell resolution .FiftyFiveLeft[  ] #### Novel approach improves prediction * $\sim 67\%$ agreement with in-situ database * $\sim 10\%$ improvement over DistMap * Continuous expression data. Not bimodal --- count:false # Drosophila expression at single cell resolution .FiftyFiveLeft[  ] #### Novel approach improves prediction * $\sim 67\%$ agreement with in-situ database * $\sim 10\%$ improvement over DistMap * Continuous expression data. Not bimodal #### Requires good choice of database * Strongly depends on number * Gene expression not independent! --- count:false # Drosophila expression at single cell resolution .FiftyFiveLeft[  ] #### Novel approach improves prediction * $\sim 67\%$ agreement with in-situ database * $\sim 10\%$ improvement over DistMap * Continuous expression data. Not bimodal #### Requires good choice of database * Strongly depends on number * Gene expression not independent! #### Study principles of expression patterns * Computational prediction for $10^4$ genes * Test mutual information hypothesis! --- # Towards solving unsupervised problem ** Positional information Hypothesis: ** Gene expression should be a continuous field defined on a 2D manifold -- .left-column[ .center[ Physically close cells close in expression  ] ] --- count:false # Towards solving unsupervised problem ** Positional information Hypothesis: ** Gene expression should be a continuous field defined on a 2D manifold .left-column[ .center[  ] ] .right-column[ .center[  ] ] --- # Future directions #### Formulate unsupervised problem * Find assignment that maximizes positional precision using OT `$$\sigma_x^{-2} = \displaystyle\sum_{i,j=1}^G \partial_x \bar{g}_i C_{ij} \partial_x \bar{g}_j$$` * Test hypothesis given our already mapped data * Use manifold learning to find 2D surface in expression data #### Big dream * Allows us a unique assay in evo-devo style questions * Comparative study of gene expression profile across many _Drosophila_ species * Can we use as a test for when genes  positional information?