.center[ .vertical-center[ # Learning the geometry of expression in early Drosophila embryogenesis Nicholas Noll Kavli Institute for Theoretical Physics, Santa Barbara, USA ] ] ??? * Hello everyone! * As the slide suggests, my name is Nicholas Noll, currently a postdoctoral fellow at the KITP in sunny Santa Barbara. * I am happy and excited to be hear to tell you a bit about my recent, unpublished work looking into how positional information is encoded into the transcriptome of early development. * I want to emphasize that questions and interruptions are welcome and strongly encouraged! --- # Introduction .center-col33[ .center[ ![:scale 330](/figs/devbio/morpho/inference_embryo.svg) Mechanics of development ] ] ??? * I thought it would be nice just to give you a brief overview of my past work before diving into the most recent project. * I started my journey into Biology during my PhD work which primarily centered on understanding/inferring the forces at play during early morphogenesis. * As its hard to measure forces experimentally, I formulated a model and subsequently an inference algorithm to infer the stress tensor based upon the observed cellular geometry. -- .center-col33[ .center[ ![:scale 330](/figs/basel/timetree.png) Pathogen evolution ] ] ??? * My first postdoc was a departure from develomental biology. * I joined the group of Richard Neher to study microbial evolution. * Launched a sequencing study in collaboration with the local clinicians to de novo assemble antibiotic resistant bacterial genomes. * Wanted to understand/infer patterns of HGT in the context of ARG evolution. * In process, wrote an algorithm to align whole genomes into a graph based data structure. -- .right-col33[ .center[ ![:scale 190](/figs/devbio/seqspace/encoder-rotated.svg) scRNAseq of development ] ] ??? * Finally, here at the KITP I have embarked on a project that has combined my experience with NGS and developmental biology. * This is the story I plan to speak on today. --- # Acknowledgements .center[ ![:scale 700](/figs/devbio/seqspace/acknowledgements.svg) ] ??? * I want to start by acknowledging my collaborators for this project. * They have been incredibly helpful bouncing ideas off of and proposing new experiments to test the ideas. --- # How do cells know where they are in space? #### Focus on Drosophila embryogenesis .left-col50[ .center[ Early drosophila development ![:scale 440](/figs/devbio/drosophila/embryo.png) ] ] ??? * A central question in developmental biology considers the nature of how cells can "know" where they are in space. * More pointedly let's focus on the model organism, Drosophila melogonaster (fruit fly), shown here at ~2hours PF. * Roughly 6000 cells within a simple epithelial monolayer that covers the surface of the yolk. * The embryo was caught just at the onset of gastrulation: can be seen by the initial formation of the cephalic furrow and posterior midgut. * Belly(ventral) on the bottom, Back(dorsal) at top. Head(anterior) to the left, tail (posterior) to the right. * Reframe the general question specifically: how does the embryo break head/tail symmetry? -- .right-col50[ .center[ Morphogens provide positional information ![:scale 410](/figs/devbio/drosophila/bicoid.png) ] ] ??? * Answer: the mother breaks it! * Nobel prize winning work from Eric Wieschaus and Christiane Nusslein Vollhard identified maternal mRNAs that are deposited into the head(bicoid)/tail(nanos) pole of each egg. * These mRNAs diffuse and degrade (are used to build proteins by the nuclei in the bastula), setting up an "exponential"/monotonic profile in space. * In green I am showing Bicoid, the first identified developmental morphogen. As named, if knocked out in the mother, will result in an embryo that forms with two tails. * Strongly binary phenotype. * Cells can "measure" distance from head pole by measuring concentration. --- count:false # How do cells know where they are in space? #### Focus on Drosophila embryogenesis .left-col50[ .center[ French flag model ![:scale 440](/figs/devbio/drosophila/frenchflag.png) ] ] .right-col50[ .center[ Morphogens provide positional information ![:scale 410](/figs/devbio/drosophila/bicoid.png) ] ] ??? * Bicoid itself is a transcription factor that activates many downstream targets. * "Measurement" is often conceptualized in a breakpoint model, i.e. the french flag of development. * Simple algorithm for fate specification: if concentration greater than cutoff, adopt fate A. else fate B * Here, in blue, I show hunchback, which plausibly follows the paradigm. --- count:false # How do cells know where they are in space? #### Focus on Drosophila embryogenesis .left-col50[ .center[ French flag model ![:scale 440](/figs/devbio/drosophila/frenchflag.png) ] ] .right-col50[ .center[ Positional information informs fate ![:scale 495](/figs/devbio/drosophila/ap_signaling_cascade.png) ] ] ??? * The actual world is more complicated! * The result of Eric and Christiane's work was the elucidation of the entire AP patterning system, shown here. * Conceptualized as a hierarchical transduction network: bcd -> gap genes -> pair rule genes -> segment polarity * Broad range positional information is refined into fine, reproducible patterns of cell fate. * Segments directly correspond to future larvel/pupal body segments. --- count:false # How do cells know where they are in space? #### Theory of Positional information .left-col50[ .center[ L. Wolpert. Positional Information and Pattern Formation in Development ![:scale 500](/figs/devbio/seqspace/wolpert_positional_information.png) ] ] ??? * The progenitor of the "discrete" framework of positional information is Lewis Wolpert. * All of morphogenesis and pattern formation can be conceptualized as a system that intercalates positional values to be smooth function. * Proposed cells acquire positional information with respect to discrete boundaries, rather than a true position. * Succintly: Position is an integer not a real number. -- .right-col50[ .center[ ![:scale 500](/figs/devbio/seqspace/bialek_positional_info_header.png) ![:scale 450](/figs/devbio/seqspace/bialek_positional_info_fig.png) ] * Positional information is constant * Determined by gap genes ] ??? * An updated, quantitative take on the same system was published a decade ago from Bialek et al. * Immunoflorescence antibody staining of the 4 gap genes taken simultaneously. * Focused on expression along the AP axis of the midsagittal plane. * Measured across ensemble of embryos, compute the mutual information between the 4 gap genes and AP position. * Claim positional information is constant and specifies each cell's identity up to a cell size. * Real number, not integer! --- # How is space encoded in the early transcriptome? #### Focus on Drosophila embryogenesis .left-col50[ .center[ Transcriptome as spatial function? ![:scale 440](/figs/devbio/drosophila/embryo.png) ] ] .right-col50[ .center[ ![:scale 400](/figs/devbio/seqspace/pointcloud.svg) ] ] ??? * I want to reformulate this question in the age of scRNAseq. * The thought experiment I wish to perform is go back to the original drosophila embryo at the onset of gastrulation. * Imagine we could measure the gene expression for each cell in the epithelial monolayer. * Collect into a point cloud. --- count:false # How is space encoded in the early transcriptome? #### Focus on Drosophila embryogenesis .left-col50[ .center[ Transcriptome as spatial function? ![:scale 440](/figs/devbio/drosophila/embryo.png) ] ] .right-col50[ .center[ ![:scale 400](/figs/devbio/seqspace/pointcloud_surface.svg) ] ] ??? * If there is a true "mean" gene expression that is a function of spatial coordinates, then the point cloud expression data can be best thought of as sampled from a low-dimensional manifold. * When viewed this way, it almost becomes as a canonical dimensional reduction problem. --- count:false # How is space encoded in the early transcriptome? #### Focus on Drosophila embryogenesis .left-col50[ .center[ Continuum transcriptome ansatz ![:scale 440](/figs/devbio/drosophila/embryo_AP.png) ] ] .right-col50[ .center[ ![:scale 400](/figs/devbio/seqspace/pointcloud_surface_coordinates.svg) ] ] ??? * Moreover, we make a strong ansatz: the primary directions of this manifold _are_ space! * Thus, if this is indeed true, we should be able to de novo discover the AP axis from the distribution of expression point cloud. * Thus, if this is indeed true, we could be able to "infer" space from this surface. --- count:false # How is space encoded in the early transcriptome? #### Focus on Drosophila embryogenesis .left-col50[ .center[ Pose as inference problem ![:scale 440](/figs/devbio/drosophila/embryo.png) ] ] .right-col50[ .center[ ![:scale 400](/figs/devbio/drosophila/scrnaseq.png) ] ] ??? * This is indeed how we pose the problem, given a bag of cells with sequenced transcriptomes obtained from scRNAseq, can we infer where they were sampled from on the original embryo. --- # Source of data .center[ ![:scale 650](/figs/devbio/seqspace/drosophila_single_cell_resolution_header.png) ] ??? * All data discussed today was obtained from public NCBI GEO repositories. * Generated from the 2017 drop-seq study shown here. --- # Three components to our approach .left-col33[ .center[ **scRNAseq normalization** ![:scale 200](/figs/devbio/seqspace/normalize/method_splice.png) ] ] ??? * Time permitting, I hope to discuss the three main pillars of this study with you today. * The first component is a novel normalization scheme based upon random matrix theory. * Linear transformation of the data that stabilizes the variance unlike problematic log normalization. -- .center-col33[ .center[ **Supervised inference** ![:scale 350](/figs/basel/dev/eve.png) *Predicted Eve expression* ] ] ??? * Another component will outline how to match scRNAseq data to a known database. * I label this section as a "supervised" inference. * Generate a set of "known" labels to subsequently check our learned manifold against. -- .right-col33[ .center[ **Unsupervised inference** ![:scale 190](/figs/devbio/seqspace/encoder-rotated.svg) ] ] ??? * Lastly, I will describe our novel approach to manifold learning in the context of scRNAseq data. * Autoencoder architecture regularized by constrained pairwise distance. * With that said, let's dive in! --- # scRNAseq data requires normalization #### Numerous sources of technical noise .left-col33[ .center[ ![:scale 330](/figs/devbio/seqspace/normalize/depth.png) Large variation across sequencing depth/cell ] ] ??? * scRNAseq data critically requires preprocessing steps to enable downstream analysis with minimal technical artifacts. * These preprocessing steps are an attempt to "correct" for technical noise in the sequencing process. * Show up as features in the data. * Hetergeneous sequencing depth owing to variable reaction efficiencies. * Spans 2 decades -- .center-col33[ .center[ ![:scale 330](/figs/devbio/seqspace/normalize/gene_expression.png) Large variation across genes (expression + efficiency) ] ] ??? * Hetergeneous expression count across genes (real and technical). * Spans 4 decades -- .right-col33[ .center[ ![:scale 330](/figs/devbio/seqspace/normalize/gene_dropout.png) Dropout effects ] ] ??? * Overdispersed! * Shows up as a nonlinear mean vs variance relationship and an overinflation of zero counts relative to Poisson expectation. * Coined "dropout". -- Biases are ![:emph](rectified) by a process called ![:emph](normalization). -- We build upon ![:emph](previous likelihood) methods. --- # Our general model of scRNAseq counts Denote measured expression of gene $i$ within cell $\alpha$ as $n_{\alpha i}$ #### Model as: .center[![:scale 600](/figs/devbio/seqspace/normalize/matrix_cartoon.png)] $$ n\_{i\alpha} = \mu\_{i\alpha} + \delta\_{i\alpha} $$ ??? * Roman letters index genes (rows). Greek index cells (columns). * Critically we assume that the true count matrix is "low" rank compared to the size of the full count matrix. * Can view this as a linearized version of our positional manifold ansatz. * The full rank of the measured count matrix then comes from the technical noise $\delta$. -- #### Our goal: * Estimate the low-rank mean $\mu\_{i\alpha}$ * Estimate the measurement variance $\langle \delta\_{i\alpha}^2 \rangle$ ??? * Normalization can be viewed as estimating this decomposition. * Importantly, the brackets denote an average over the _theoretical_ ensemble of sequencing the same batch. * As such, it will ultimately require an inference based upon a Bayesian prior. --- # How to estimate the decomposition? #### Formalize as a stochastic process * $\delta\_{i\alpha}$ is a $N_g \times N_c$ (large) ![:emph](random) matrix * Averages to ![:emph](zero) across runs * Captures ![:emph](only) the variability in the sequencing process * Will require a specific likelihood model to infer ??? * To this end, we formalize the data as a generative stochastic model with the following properties: -- #### Complication Sample variance depends on both the cell $\alpha$ and the gene $i$ ??? * Both gene expression and cell depth are themselves stochastic variables. * Reasonable to assume each element of matrix $\delta$ samples from a _different_ distribution. -- * Homoskedastic: all random variables have equal variance * ![:emph](Heteroskedastic): variance differs for each random variable ??? * Important jargon going forward -- .center[**We only have one realization of the stochastic process!**] ??? * This complicates the inference! Can't use independent elements as samples from the same ensemble. * Only have one realization to draw inferences from! * We will build this inference up piecewise. * Start from a homoskedastic matrix and then generalize to our true model of the data. --- # Why does heteroskedasticity make our life difficult? Prevents us from directly using random matrix theory to infer $\delta$. How to see? -- #### Thought experiment: * Assume each element of $\delta$ is sampled from a Gaussian with zero mean and variance $\sigma^2$ * Distribution of singular values $\lambda$ asymptotically follow ![:emph](Marchenko-Pastur) distribution -- .left-col50[![:scale 500](/figs/devbio/seqspace/normalize/marchenko-pastur.png)] -- .right-col50[ * As size of matrix increases, approximation improves. * Has a maximum value: $$ \lambda\_{+} \equiv \sigma\left(\sqrt{N_g} + \sqrt{N_c}\right) $$ ] --- # Spike-in model Consider a rank 1 perturbation of a purely ![:emph](Gaussian) random matrix $$ n\_{i\alpha} = \gamma x\_{i} \bar{x}\_\alpha + \delta\_{i\alpha} $$ It has been shown <sup>1</sup> that: .footnote[<sup>1</sup>.cite[D FĂ©ral The largest eigenvalue of rank one deformation of large Wigner matrices. (2006)]] -- * If $\gamma \le \lambda\_{+}$, the top singular value of $n$ converges to $\lambda\_+$ -- * If $\gamma > \lambda\_{+}$, the top singular value of $n$ converges to $\gamma + \lambda\_+/\gamma$ * If $\gamma \ge \lambda\_{+}$, the overlap of the top eigenvector of $n$ with $x$ converges to $1 - (\lambda\_+ / \gamma)^2$ ??? * Phase transition at $\gamma = \lambda\_{+}$ * Top principal component will now limit to the perturbation. -- #### Suggests simple algorithm to detect $\mu_{i\alpha}$ * Compute SVD of $n\_{i\alpha}$ * Keep statistically significant components: $\lambda$ larger than $\lambda_+$ --- count:false # Spike-in model #### Toy data .left-col50[![:scale 550](/figs/devbio/seqspace/normalize/gaussian_svd.png)] ??? * Simple to verify the validity of the algorithm empirically. * Worthwhile to walk through the plot as it will come up again. * On the vertical axis is the component rank, descending order (1st being the largest principal value). * On the horizontal axis is the singular value. * Both are log scale. * Yellow and the green dashed lines are the known spectra of the decomposition. -- .right-col50[![:scale 550](/figs/devbio/seqspace/normalize/gaussian_overlap.png)] ??? * Great agreement between the predicted and known mean! * However, this is limited to homoskedastic matrices. --- # Generalize this idea to heteroskedastic matrices? Shown <sup>1, 2</sup> the distribution of $\lambda$ of heteroskedastic matrix converges to ![:emph](Marchenko-Pastur) if .footnote[<sup>1</sup>.cite[M. Idel. et al (2016), <sup>2</sup>.cite[B. Landa Biwhitening Reveals the Rank of a Count Matrix (2021)]]] * Average variance for each row is one * Average variance for each column is one ??? * Recent work has shown the eigenvalues of a (constrained) heteroskedastic matrix have similar properties. * Specifically, a double stochastic matrix, i.e. the row and column sum are 1. -- #### Suggests simple algorithm * Introduce cell $c\_\alpha$ and gene $g\_i$ scale factors, $\tilde{n}\_{i\alpha} \equiv g\_i n\_{i\alpha} c_\alpha$ ??? * Tildes denote a "rescaled" variant of each quantity. -- * Obtain by solving $$ \sum\_\alpha \langle\tilde{\delta}\_{i\alpha}^2\rangle = \sum\_\alpha g\_i^2 \langle\delta\_{i\alpha}^2 \rangle c\_\alpha^2 =N\_c \qquad \sum\_i\langle\tilde{\delta}\_{i\alpha}^2\rangle = \sum\_i g\_i^2 \langle\delta\_{i\alpha}^2 \rangle c\_\alpha^2 =N\_g $$ * Given model for $\langle \delta_{i\alpha}^2 \rangle$, above can be solved using ![:emph](Sinkhorn-Knopp). ??? * Sinkhorn-Knopp is an iterative algorithm that alternatives between rescaling rows and columns to enforce the double stochastic constraint. -- * Compute SVD of $\tilde{n}_{i\alpha}$. * Keep components with $\lambda > \lambda\_+$ --- # Modelling overdispersed count data #### Negative binomial distribution for ![:emph](each gene) Utilize a generalized linear model to account for sequencing depth $n_\alpha$ variation $$ p(n\_{i\alpha}|\mu\_{i\alpha},\phi_i) = \frac{\Gamma(n+\phi)}{\Gamma(n+1)\Gamma(\phi)} \left(\frac{\mu}{\mu+\phi}\right)^n \left(\frac{\phi}{\mu+\phi}\right)^{\phi}$$ ??? * We write a negative binomial parameterized in terms of mean $\mu$ and overdispersion factor $\phi$. * Usually see this for discrete $\phi$ and probability of success $p$. -- The mean of the distribution is given by $$ \log\left(\mu\_{i\alpha}\right) \equiv A\_i + B\_i\log\left(n\_{\alpha}\right) $$ ??? * $n\_\alpha$ is the sequencing depth for cell $\alpha$. * $B\_i$ controls the power law of scaling. Strong prior to be 1 * $A\_i$ sets the scale for each gene. -- Fit $A_i$, $B_i$, and $\phi_i$ per gene by maximum likelihood, given the observed counts per gene. Take care to not overfit for lowly expressed genes! --- count:false # Modelling overdispersed count data #### Negative binomial distribution fits data .center[ ![:scale 700](/figs/devbio/seqspace/normalize/model_fit.png) ] --- # Modelling overdispersed count data #### Putting it all together: Normalization schema Unbiased estimator for the variance of negative binomial $$ \langle \delta^2\_{i\alpha} \rangle = \frac{\mu\_{i\alpha} + \phi\_i \mu\_{i\alpha}^2}{1 + \phi\_i} $$ -- The initial estimate for the mean is given by our estimated GLM model $$ \log\left(\mu\_{i\alpha}\right) = A\_i + B\_i\log\left(n\_{\alpha}\right) $$ -- Normalization factors estimated by (![:emph](Sinkhorn-Knopp)) $$ \sum\_\alpha g\_i^2 \langle\delta\_{i\alpha}^2 \rangle c\_\alpha^2 =N\_c \qquad \sum\_i g\_i^2 \langle\delta\_{i\alpha}^2 \rangle c\_\alpha^2 =N\_g $$ --- # Method accurately recapitulates toy data .left-col50[ #### Naive SVD doesn't see low rank ![:scale 525](/figs/devbio/seqspace/normalize/negbinom_naive.png) ] ??? * No obvious knee. Maybe rank 1 or 2. * Overdispersed variance causes a huge overestimation of rank by MP. --- count:false # Method accurately recapitulates toy data .left-col50[ #### Rescaled SVD does ![:scale 525](/figs/devbio/seqspace/normalize/negbinom.png) ] ??? * Rescaling discovers the majority of the low rank mean. * Imperfect: Components below the MP cutoff are not detected. -- .left-col50[ #### Noisy reconstruction of mean ![:scale 525](/figs/devbio/seqspace/normalize/negbinom_mean.png) ] ??? * Mean can be inferred! Much noisier though. * Higher variance due from missing components _and_ propagated error from uncertainty in $\phi$. --- # Drosophila embryo expression fits well .left-col50[ ![:scale 520](/figs/devbio/seqspace/normalize/estimated_rank.png) ] ??? * Discover 45 statistically significant components * No sharp knee. Reasonable to assume we are missing degrees of freedom below the noise. -- .right-col50[ ![:scale 520](/figs/devbio/seqspace/participation_ratio.svg) ] ??? * Interestingly, these 45 components are _not_ localized on a few genes. * I show the participation ratio here, a measure of localization. * Can view it as roughly corresponding to the number of genes that actively contribute to each eigenvector. * Each component involves $250-1000$ genes! Not $4$ gap genes. * Not as fully delocalized as the noise though. Some functional group structure detected. --- .center[ .vertical-center[ # Can we leverage existing databases to embed scRNAseq into space? ] ] --- # How to map normalized scRNAseq data to space? .center[Berkeley Drosophila Transcription Network Project] .center[ ![:scale 800](/figs/devbio/drosophila/bdtnp.jpg) ] .center[Aligned point clouds of $\sim 80$ genes] ??? * Over a decade ago, the Berkeley Drosophila Transcription Network Project set up to understand the complex network of transcriptional regulation required for development. * Used Drosophila melongaster as a model system to explore formation of expression patterns. * Took coarse time-series cohort data. * Confocal microscopy of florescence data generated a point cloud of a handful of genes at a time. * Arduous task of aligning the various point clouds to a common reference embryo. * Result is a point cloud of 3D expression patterns. --- count: false # How to map normalized scRNAseq data to space? .center[ ![:scale 1000](/figs/devbio/seqspace/data_overview_pointcloud.svg) ] ??? * Recall the shape of our scRNAseq data is a normalized table of genes by cells. * The database gives us 3D point clouds of expression. * Here I show just one "channel", eve. --- count: false # How to map normalized scRNAseq data to space? .center[ ![:scale 1000](/figs/devbio/seqspace/data_overview.svg) ] ??? * Can view this just as another matrix. * Here positions, or virtual cells, are just different columns in the matrix. * Don't model space explictly. Map to the points within the point cloud. * Conceptualize this as mapping one bag of cells to another, where one bag of cells happens to have spatial labels. --- count:false # How to map normalized scRNAseq data to space? .center[ ![:scale 1000](/figs/devbio/seqspace/data_mapping.svg) ] ??? * Formulate a probabilistic model, i.e. whats the probability each cell in our scRNAseq dataset was sampled from each position in our BDTNP dataset. * View this as finding a probability distribution for each column of our scRNAseq table over columns of our database table. -- #### Regularized optimal transport $$ E\left(\rho\_{i\alpha}\right) = \displaystyle\sum\limits\_{i\alpha} \rho\_{i\alpha} J\_{i\alpha} + T\displaystyle\sum\limits\_{i\alpha} \rho\_{i\alpha} \log\left(\rho\_{i\alpha}\right) + \text{marginal constraints}$$ ??? * Best framed in the optimal transport language: what's the minimal "cost" to assign each scRNAseq cell to a position on the embryo. * It is not reasonable to search for a bijection here! Instead want the probability over positions. * Regularize it by penalizing the entropy of the distribution. * Solution is found again by Sinkhorn-Knopp, once we compute the cost matrix $J$. --- # Recapitulate the in-situ database .left-col50[ ![:scale 505](/figs/basel/dev/correlation_vs_temperature_of_fit_gmm_continuous.png) ] ??? * Linear correlation of predicted mean expression pattern of mapped scRNAseq data to BDTNP database as a function of inverse temperature. * Non-monotonic: infinite temperature is the uniform solution, zero temperature is singular and susceptible to noise. * Error bars are standard deviation over database genes. * Capture $\sim70\%$ of the variation. -- .right-col50[ ![:scale 505](/figs/basel/morpho/optimal_transport_mapping_entropy.png) ] ??? * Higher temperature implies a less resolved map. * Each cell is mapped to $~30$ possible positions. * Not as precise as the princeton survey. --- # Drosophila expression at "single-cell" resolution .center[ ![:scale 342](/figs/devbio/seqspace/disco.png) ![:scale 342](/figs/devbio/seqspace/Kr.png) ![:scale 342](/figs/devbio/seqspace/twi.png) ![:scale 342](/figs/devbio/seqspace/eve.png) ] Shown above are (top) disco, kruppel, and (bottom) twist, eve --- # Collaboration to produce web interface .middle[ .center[ <video width="700" height="525" frameborder="0" controls autoplay> <source src="/figs/devbio/drosophila/webpage.mp4" type="video/mp4"> </video> ] ] --- .center[ .vertical-center[ # Can we infer space directly from the transcriptome? ] ] --- # Density scales as if low-dimensional manifold .left-col50[ .center[ ![:scale 500](/figs/devbio/seqspace/radius_scaling_pointcloud.svg) ] ] --- count:false # Density scales as if low-dimensional manifold .left-col50[ .center[ ![:scale 500](/figs/devbio/seqspace/radius_scaling_ball.svg) ] ] -- .right-col50[ .center[ ![:scale 550](/figs/devbio/seqspace/euclidean_ball_scaling.svg) ] ] --- count:false # Density scales as if low-dimensional manifold .left-col50[ .center[ ![:scale 500](/figs/devbio/seqspace/radius_scaling_neighborhood.svg) ] ] --- count:false # Density scales as if low-dimensional manifold .left-col50[ .center[ ![:scale 500](/figs/devbio/seqspace/radius_scaling_all_neighborhood.svg) ] ] --- count:false # Density scales as if low-dimensional manifold .left-col50[ .center[ ![:scale 500](/figs/devbio/seqspace/radius_scaling_shortest_path.svg) ] ] -- .right-col50[ .center[ ![:scale 575](/figs/devbio/seqspace/geodesic_ball_scaling.svg) ] ] --- # Formulation of manifold inference #### Regression for homeomorphism of manifold .center[![:scale 900](/figs/devbio/seqspace/auto_encoder.svg)] ![:emph](Autoencoder): find an identity map that projects to low-dimensions $$ E(W, b) = \sum\_{i,a} \left(x\_{ia} - y\_{ia}\right)^2 = \sum\_{ia} \left(x\_{i\alpha} - \phi^{-1}\left(z\_{ia}\right) \right) = \sum\_{ia} \left(x\_{i\alpha} - \phi^{-1}\left(\phi\left(x\_{i\alpha}\right)\right) \right)^2 $$ --- # Difficult intrepretation of vanilla autoencoder .left-col50[ .center[ Sparse clusters for MNIST. Not generative ![:scale 500](/figs/devbio/seqspace/mnist_ae.png) ] ] -- .left-col50[ .center[ Similar results for scRNAseq data ![:scale 550](/figs/devbio/seqspace/vanilla_ae_latent.png) ] ] -- .center[Need a way to ![:emph](regularize) the learning process] --- # Regularize by preserving topology of scRNAseq data .left-col50[ .center[ ![:scale 400](/figs/devbio/seqspace/radius_scaling_shortest_path.svg) ] Topology hard to parameterize. Settle for distances ] -- .right-col50[ #### Main idea: * Estimate pairwise geodesic distances $D_{ij}$ * Impose latent space "isometry" $$|z\_{i} - z\_{j}| \sim D\_{ij}$$ ] -- .right-col50[ .center[ ![:scale 400](/figs/devbio/seqspace/auto_encoder.svg) ] $$ E(W, b, \Lambda) \sim \sum\_{i,a} \left(x\_{ia} - y\_{ia}\right)^2 + \Lambda \sum\_{i,j} \left(D\_{ij} - |\vec{z}\_i - \vec{z}\_j| \right)^2$$ ] --- # Learns canonical manifolds .left-col50[ .center[ Swiss roll ![:scale 550](/figs/devbio/seqspace/swiss_roll.png) ] ] .right-col50[ .center[ Autoencoder latent space ![:scale 550](/figs/devbio/seqspace/swiss_roll_learned.png) ] ] --- # Learned manifold recapitulates scRNAseq data Use the described loss function and architecture on Drosophila scRNAseq .left-col50[![:scale 500](/figs/devbio/seqspace/scrna_loss.svg)] .right-col50[![:scale 500](/figs/devbio/seqspace/scrna_genes_vs_reconstructed.png)] -- .center[ Capture $\sim 80\%$ of the variances in the normalized scRNA counts ] --- # Learned manifold recapitulates space #### Estimated positions .left-col50[ .center[ ![:scale 500](/figs/devbio/embedding/AP_bdtnp.png) ] ] .right-col50[ .center[ ![:scale 500](/figs/devbio/embedding/DV_bdtnp.png) ] ] -- AP and DV positions fall along diagonals of 2D square. -- Investigation into the third dimension ongoing. --- # UMAP does not .left-col50[ .center[ AP axis ![:scale 500](/figs/devbio/seqspace/umap_drosophila_ap.png) ] ] .left-col50[ .center[ DV axis ![:scale 500](/figs/devbio/seqspace/umap_drosophila_dv.png) ] ] --- .center[ .vertical-center[ # Conceptual Outlook ] ] --- count:false # Many-body formulation of positional information .left-col50[ .center[ ![:scale 450](/figs/devbio/seqspace/spatial_vs_expression_geodesic.svg) ] ] --- count:false # Many-body formulation of positional information .left-col50[ .center[ Isometric embedding ![:scale 350](/figs/devbio/seqspace/butterfly_bdntp.png) ] ] -- .right-col50[ .center[ ![:scale 560](/figs/devbio/seqspace/expression_to_space_jacobian_cartoon.svg) ] ] -- .left[ #### Takeaways ] Previous attempt<sup>1</sup> assumed isometry between space and expression .footnote[ <sup>1</sup>.cite[Nitzen M. et al. Gene expression cartography Nature 2019] ] -- Jacobian ![:emph](captures) positional information of whole ![:emph](transcriptome)