.center[ .vertical-center[ # Computational techniques to analyze pangenome evolution Nicholas Noll Kavli Institute for Theoretical Physics, UCSB ] ] --- # Sequences encode evolutionary history .center[] ??? * Vertical inheritance assumption underpins most of our computational techniques * Implicit assumption that counting mutational distance is proxy for passage of time and thus relatedness structure * Proven immensely useful in the era of COVID-19. --- # Sequences encode evolutionary history .center[] .center[Polymorphisms defined relative to sequence alignment] ??? * However all our techniques start from the assumption of a linear alignment with SNPs. * Easy to define a distance metric! * Only really holds for viral evolution (and there only a single segment!) --- # Microbial evolution is different .left-col50[ .vertical-center[ .center[ ] ] ] -- .right-col50[ #### Evolution experiment w/ Bacillus Subtilis<sup>1</sup> * HGT events transfer $\approx 5$ genes * Recombination changes $\approx 100$ bp/gen * Mutation changes $\approx 10^{-1}$ bp/gen ] .footnote[<sup>1</sup>.cite[Power et al. PNAS. (2021)]] ??? ### Left side * Well appreciated at this point that bacterial evolution does not fit this computational model * Plethora of phenomena demonstrate bacteria share DNA horizontally with their community * Don't know the rates quantitatively ### Experimental * Very recent studies have tried to estimate the "contribution" of HGT + HR to bacterial evolution * Michael Lassig's group: 2 strains, donor (2A9) and acceptor (BD630) * 2-day cycle with 6 steps: dilution, radiation, plating, colony selection, induce competence + donor DNA, overnight grow * Natural competence at high densities. They ectopically drive it by controlling the master regulator * Sequence to Illumina and just see how often you find donor strain --- count:false # Microbial evolution is different .left-col50[ .vertical-center[ .center[ ] ] ] .right-col50[ #### Evolution experiment w/ Bacillus Subtilis<sup>1</sup> * HGT events transfer $\approx 5$ genes * Recombination changes $\approx 100$ bp/gen * Mutation changes $\approx 10^{-1}$ bp/gen #### Wild E. coli isolates<sup>2</sup> * Phylogeny changes every $\approx 100$ bp * Evolution $\ne$ clonal ] .footnote[<sup>1</sup>.cite[Power et al. PNAS. (2021)] <sup>2</sup>.cite[Sakoparnig, F. et al. Elife (2021)]] ??? ### Look in nature * Erik van Nimwegen's group looked at wild ecoli isolates * 91 E. coli strains isolated from the same habitat. * Focused just on core genome * Estimated recombination by using 4 gamete test on biallelic loci * For close pairs, bimodal distribution in SNP density. * Low density = clonal. High = HR. Fraction of distribution in each class allows you to estimate the fraction * Estimate by the time you hit 1% divergence, all clonal signature has been erased * Even for close strains where clonal fraction is large, most substitutes are still introduced by HR --- count:false # Microbial evolution is different .left-col50[ .vertical-center[ .center[ ] ] ] .right-col50[ #### Evolution experiment w/ Bacillus Subtilis<sup>1</sup> * HGT events transfer $\approx 5$ genes * Recombination changes $\approx 100$ bp/gen * Mutation changes $\approx 10^{-1}$ bp/gen #### Wild E. coli isolates<sup>2</sup> * Phylogeny changes every $\approx 100$ bp * Evolution $\ne$ clonal #### Comparative genomics * Modern notion of pangenome * Large gene variation across isolates of same species ] .footnote[<sup>1</sup>.cite[Power et al. PNAS. (2021)] <sup>2</sup>.cite[Sakoparnig, F. et al. Elife (2021)]] ??? ### Pangenome * Expanding our view further than HR, we know HGT must be prevalent by the many comparative studies that exist * Look at a cross section of the genes found in isolates of any given species and you see a diverse, expanding set as you look at more isolates All taken together, it's very hard to infer or even summarize the evolutionary relatedness of a set of bacterial sequences! To help think about this problem, we try to look for a "model" system that's concrete --- # Case study: Carbapenem resistance is quickly spreading .left-col50[.center[]] .right-col50[.center[]] .footnote[.cite[ECDC Rapid Risk Assessment: Carbapenem-resistant Enterobacteriaceae (2016)]] ??? ### Proposal: study the evolution of antibiotic resistance * ARG are thought to grow in prevalence primarily through HGT * Study a particular worrying variant of AR -> resistance to last line of defense antibiotics * Here I show a scare plot that shows the rapid increase in AR in Europe over half a decade * Colors are binning the estimated fraction of clinical isolates that have a gene that confers resistance * Go up to $\sim$ 10% -- .left-col50[ #### Surveillance needed * Drug reserved for multi-resistant bugs * Resistance conferred by many genes * Estimate rates? Track evolution? ] ??? * Understanding this particular case has a lot of utility * Would be nice to have an analog of COVID-19 variant tracking -- .right-col50[ #### An uncontrolled evolution experiment * Strong selective pressure for 3 decades * Learn something fundamental - <sub><sup> e.g. Genes or genomes? </sub></sup> ] ??? * Interesting dataset theoretically * Qualitatively know the timescales involved * Potential access to longitudinal data --- # 110 carbapenem-resistant bacteria in Basel over 7 years .left-col50[] .left-col50[] ??? * We undertook a (at the time) large scale sequencing project -- * Isolated from patients at University Hospital - <sup><sub>Chosen based on exhibiting carb. resistant phenotype </sup></sub> -- * At time, one of larger sequencing studies - <sup><sub>Increased # of complete Klepne in RefSeq by $\sim 10\%$ </sub></sup> ??? * Clinician Adrian Egli had been freezing AR isolates for the past decade * Phenotyped each isolate and their resistance * We systematically went through the list to find carbapenem resistant isolates --- count:false # 110 carbapenem-resistant bacteria in Basel over 7 years .left-col50[ .center[  ] ] .right-col50[ .center[  ] ] * Isolated from patients at University Hospital - <sup><sub>Chosen based on exhibiting carb. resistant phenotype </sup></sub> * At time, one of larger sequencing studies - <sup><sub>Increased # of complete Klepne in RefSeq by $\sim 10\%$ </sub></sup> * High quality assemblies - <sup><sub>Resolve genomic context</sub></sup> ??? * Undertook a hybrid approach, utilizing a combination of Illumina and ONT sequencing * Critical as Illumina assemblies often break on repetitive elements * Shown on the left is the short-read assembly contigs mapped onto our hybrid assembly * Carbapenemase often found in transposons flanked by repeats * Shown on the right is the length of contig for each assembly approach: long reads required for genomic contexts * Note you can also see most of the carbapenemases are found on plasmids here: 60% --- count:false # 110 carbapenem-resistant bacteria in Basel over 7 years .center[  ] * Isolated from patients at University Hospital - <sup><sub>Chosen based on exhibiting carb. resistant phenotype </sup></sub> * At time, one of larger sequencing studies - <sup><sub>Increased # of complete Klepne in RefSeq by $\sim 10\%$ </sub></sup> * High quality assemblies - <sup><sub>Resolve genomic context</sub></sup> - <sup><sub>Nucleotide accuracy</sub></sup> ??? * Long reads still not enough: sequencing errors creates a lot of false genes * On left, our error rates are consistent with Illumina on the vast majority of the assembly * Our genes are strongly peaked at their expected lengths according to Uniprot. * No coverage anomalies. --- # Short dynamics dominated by HGT .left-col50[ #### Kleb core gene tree .center[  ]] ??? * Play the usual game: find the core genome and build a "phylogeny" based on SNPs -- .right-col50[ #### Gene P/A vs SNP distance .center[  ] ] ??? * Can ask if the presence/absence of accessory genes follows the phylogenetic distance * Sublinear scaling! * Moreover you see that even very close pairs of isolates differ by $\sim 100$ genes. * Conclude, on short times (compared to mutation fixation time) HGT dominated genome evolution -- .right-col50[ * Use HGT as a short-range clock? * How to count "structural" mutation? ] ??? * Can ask if there is any clock-like nature to HGT that we can use to infer evolutionary distances? * What distance metric do we use between sequences? --- count:false # Focus analysis on plasmids .left-col50[ #### Infer evolutionary relationships? .center[   ] ] ??? * Explore these questions in the context of plasmids * As already stated, the majority of our genes of interest fall on such mobile elements * First thought is to repeat what we did for the genome, but for plasmids. * Run into question of how to define a plasmid "species" * Sensible place to start is the conserved replication machinery, i.e. incompatability group. -- .right-col50[ #### History poorly resolved by SNPs <sup>1</sup> .center[ ] ] .footnote[<sup>1</sup>.cite[David, Sophia. et al. PNAS (2020)] ] ??? * However, such a plasmid phylogeny is stereotypically poorly resolved. * Here I'm showing you a tree for one particular family of plasmids from a very recent large-scale study in Europe. * Our results are pretty much the same (although our trees are much smaller) * We only have a handful of SNPs to distinguish * Same story for different plasmid families * Great opportunity to explore orthogonal distance metrics between sequences * Instead we are going to run with our postulated clock-like synteny changes --- # Genes as a coarse grained unit * Assume plasmid variation on short time-scales occurs in small changes of gene synteny. -- * Must computationally recognize orthologous gene clusters in our sample. .center[##### panX Algorithm<sup>1</sup>] .left-col33[ .center[  Align all ORF pairs w/ DIAMOND ] ] -- .center-col33[ .center[  MCL clustering ] ] .right-col33[ .center[  Paralogy splitting ] ] .footnote[<sup>1</sup>.cite[Ding, W. et al. panX: pan-genome analysis and exploration ] ] --- # Synteny difference as proxy for evolutionary distance .vertical-center[ .center[  ] ] --- count:false # Synteny difference as proxy for evolutionary distance .vertical-center[ .center[  ] ] --- count:false # Synteny difference as proxy for evolutionary distance .vertical-center[ .center[  ] ] --- count:false # Synteny difference as proxy for evolutionary distance .vertical-center[ .center[  ] ] --- # Structural tree captures history #### All plasmids with KPC .center[   ] --- count:false # Structural tree captures history #### All plasmids with KPC .center[   ] --- count:false # Structural tree captures history #### All plasmids with KPC .center[   ] -- #### Hypothesis * Split between red and green clade dominated by insertion of large stretch of DNA. * Insertion has multiple additional AMR genes * Red clade tightly associated with globally dominate AR Kleb. * Causal? --- # Structural tree for different molecules .left-col50[ .center[ ##### OXA-48  ] ] .right-col50[ .center[ ##### NDM  ] ] -- #### Suggests heterogeneous pattern * KPC found on few plasmids tightly associated to different genomes * OXA-48 found on few plasmids across many different genomes * NDM found on many plasmids with only transposon shared * Corroborated by subsequent studies<sup>1</sup> .footnote[<sup>1</sup>.cite[David, Sophia. et al. PNAS (2020)] ] --- # Can we generalize technique to genomes? * Slow: limited to small sample sizes. Only enough to get a qualititative sense of HGT rates -- * Extreme sensitivity to annotation errors -- * Not a proportional measure of evolutionary events, e.g. inversions & transpositions .center[  ] --- # Formalize structural diversity as a pangraph Generalize away from a reference linear coordinate system * Each genome is represented as a closed path through a graph -- * Homologous regions between species are collinear paths -- * Horizontal transfers $\approx$ # of paths passing through gene -- * Evolutionary distance $\approx$ # of breakpoints (vertices) .center[  ] --- # Computational approach to build pangraph .left-col66[  ] .right-col33[ #### High-level algorithm * Build guide tree from kmer distance ] --- count:false # Computational approach to build pangraph .left-col66[  ] .right-col33[ #### High-level algorithm * Build guide tree from kmer distance * Isolates attach to leafs as trivial graph ] --- count:false # Computational approach to build pangraph .left-col66[  ] .right-col33[ #### High-level algorithm * Build guide tree from kmer distance * Isolates attach to leafs as trivial graph * Merge graphs postorder on tree ] --- count:false # Computational approach to build pangraph .left-col66[  ] .right-col33[ #### High-level algorithm * Build guide tree from kmer distance * Isolates attach to leafs as trivial graph * Merge graphs postorder on tree * Merge pairwise between pancontigs based on minimizers ] --- count:false # Computational approach to build pangraph .left-col66[  ] .right-col33[ #### High-level algorithm * Build guide tree from kmer distance * Isolates attach to leafs as trivial graph * Merge graphs postorder on tree * Merge pairwise between pancontigs based on minimizers ] --- count:false # Computational approach to build pangraph .left-col66[  ] .right-col33[ #### High-level algorithm * Build guide tree from kmer distance * Isolates attach to leafs as trivial graph * Merge graphs postorder on tree * Merge pairwise between pancontigs based on minimizers * Genome alignment is pulled from the root ] --- # Preliminary results .left-col50[ #### PanX synteny  ] --- count: false # Preliminary results .left-col50[ #### Pangraph  ] -- .right-col50[ #### Pancontig distribution  ] --- # Preliminary results .left-col50[ #### PanX synteny  ] --- count: false # Preliminary results .left-col50[ #### Pangraph  ] -- .right-col50[ #### Pancontig distribution  ] --- # Preliminary results .left-col50[ #### PanX synteny  ] --- count: false # Preliminary results .left-col50[ #### Pangraph  ] -- .right-col50[ #### Pancontig distribution  ] --- # Future outlook .left-col50[ #### How to best interpret * Empirical: - <sub><sup>Estimate for number of events between pairs</sub></sup> - <sub><sup>Connect to population genetic parameters</sub></sup> - <sub><sup>Statistics capture many-body information?</sub></sup> ] .right-col50[  ] -- .left-col50[ * Modeling: - <sub><sup>Null expectation for the structure of graph?</sub></sup> - <sub><sup>Graph affect geneology of pancontigs?</sub></sup> ] --- # Acknowledgements .left-col33[ .center[ Richard Neher  ] ] .left-col33[ .center[ Marco Molari  ] ] .left-col33[ .center[ Adrian Egli  ] ] .center[To you all for listening!]