.center[ .vertical-center[ # Computational techniques to analyze scRNAseq data of early development Nicholas Noll Kavli Institute for Theoretical Physics, Santa Barbara, USA ] ] --- # Introduction .center-col33[ .center[ ![:scale 330](/figs/devbio/morpho/inference_embryo.svg) Mechanics of development ] ] -- .center-col33[ .center[ ![:scale 330](/figs/basel/timetree.png) Pathogen evolution ] ] -- .right-col33[ .center[ ![:scale 190](/figs/devbio/seqspace/encoder-rotated.svg) scRNAseq of development ] ] --- # How is space encoded in the early transcriptome? #### Focus on a specific case .left-col50[ .center[ Early drosophila development ![:scale 440](/figs/devbio/drosophila/embryo.png) ] ] -- .right-col50[ .center[ Morphogens provide positional information ![:scale 410](/figs/devbio/drosophila/bicoid.png) ] ] --- count:false # How is space encoded in the early transcriptome? #### Focus on a specific case .left-col50[ .center[ French flag model ![:scale 410](/figs/devbio/drosophila/frenchflag.png) ] ] .right-col50[ .center[ Morphogens provide positional information ![:scale 410](/figs/devbio/drosophila/bicoid.png) ] ] --- count:false # How is space encoded in the early transcriptome? #### Focus on a specific case .left-col50[ .center[ Transcriptome as spatial function? ![:scale 440](/figs/devbio/drosophila/embryo.png) ] ] .right-col50[ .center[ ![:scale 400](/figs/devbio/drosophila/scrnaseq.png) ] ] --- # Source of data .center[ ![:scale 650](/figs/devbio/seqspace/drosophila_single_cell_resolution_header.png) ] --- # scRNAseq data requires normalization #### Numerous sources of technical noise .left-col33[ .center[ ![:scale 330](/figs/devbio/seqspace/normalize/depth.png) Large variation across sequencing depth/cell ] ] -- .center-col33[ .center[ ![:scale 330](/figs/devbio/seqspace/normalize/gene_expression.png) Large variation across genes (expression + efficiency) ] ] -- .right-col33[ .center[ ![:scale 330](/figs/devbio/seqspace/normalize/gene_dropout.png) Dropout effects ] ] -- Biases are ![:emph](rectified) by a process called ![:emph](normalization). --- # Current normalization techniques #### Active topic of research .left-col50[ ![:scale 500](/figs/devbio/seqspace/normalize/method.png) ] -- .right-col50[ 3 main flavors * Heurestic: log normalize * External controls: spike-ins * Statistical: Bayesian inference, coarse graining We build upon ![:emph](previous likelihood) methods. ] --- # Our general model of scRNAseq counts Denote measured expression of gene $i$ within cell $\alpha$ as $n_{\alpha i}$ #### Model as: .center[![:scale 600](/figs/devbio/seqspace/normalize/matrix_cartoon.png)] $$ n\_{i\alpha} = \mu\_{i\alpha} + \delta\_{i\alpha} $$ #### Our goal: * Estimate the low-rank mean $\mu\_{i\alpha}$ * Estimate the measurement variance $\langle \delta\_{i\alpha}^2 \rangle$ --- # How to estimate the decomposition? #### Formalize as a stochastic process * $\delta\_{i\alpha}$ is a $N_g \times N_c$ (large) ![:emph](random) matrix -- * Averages to ![:emph](zero) across runs -- * Captures ![:emph](only) the variability in the sequencing process -- * Will require a specific likelihood model to infer #### Complication Sample variance depends on both the cell $\alpha$ and the gene $i$ -- * Homoskedastic: all random variables have equal variance * ![:emph](Heteroskedastic): variance differs for each random variable -- .center[**We only have one realization of the stochastic process!**] --- # Why does heteroskedasticity make our life difficult? Prevents us from directly using random matrix theory to infer $\delta$. How to see? -- #### Thought experiment: * Assume each element of $\delta$ is sampled from a Gaussian with zero mean and variance $\sigma^2$ * Distribution of singular values $\lambda$ asymptotically follow ![:emph](Marchenko-Pastur) distribution -- .left-col50[![:scale 500](/figs/devbio/seqspace/normalize/marchenko-pastur.png)] -- .right-col50[ * As size of matrix increases, approximation improves. * Has a maximum value: $$ \lambda\_{+} \equiv \sigma\left(\sqrt{N_g} + \sqrt{N_c}\right) $$ ] --- # Spike-in model Consider a rank 1 perturbation of a purely Gaussian random matrix $$ n\_{i\alpha} = \gamma x\_{i} \bar{x}\_\alpha + \delta\_{i\alpha} $$ It has been shown <sup>1</sup> that: .footnote[<sup>1</sup>.cite[D FĂ©ral The largest eigenvalue of rank one deformation of large Wigner matrices. (2006)]] * If $\gamma \le \lambda\_{+}$, the top singular value of $n$ converges to $\lambda\_+$ -- * If $\gamma > \lambda\_{+}$, the top singular value of $n$ converges to $\gamma + \lambda\_+/\gamma$ -- * If $\gamma \ge \lambda\_{+}$, the overlap of the top eigenvector of $n$ with $x$ converges to $1 - (\lambda\_+ / \gamma)^2$ -- #### Suggests simple algorithm to detect $\mu_{i\alpha}$ * Compute SVD of $n\_{i\alpha}$ * Keep statistically significant components: $\lambda$ larger than $\lambda_+$ --- count:false # Spike-in model #### Toy data .left-col50[![:scale 550](/figs/devbio/seqspace/normalize/gaussian_svd.png)] -- .right-col50[![:scale 550](/figs/devbio/seqspace/normalize/gaussian_overlap.png)] --- # Generalize this idea to heteroskedastic matrices? Shown <sup>1, 2</sup> the distribution of $\lambda$ of heteroskedastic matrix converges to ![:emph](Marchenko-Pastur) if .footnote[<sup>1</sup>.cite[M. Idel. et al (2016), <sup>2</sup>.cite[B. Landa Biwhitening Reveals the Rank of a Count Matrix (2021)]]] -- * Average variance for each row is one -- * Average variance for each column is one -- #### Suggests simple algorithm * Introduce cell $c\_\alpha$ and gene $g\_i$ scale factors, $\tilde{n}\_{i\alpha} \equiv g\_i n\_{i\alpha} c_\alpha$ -- * Obtain by solving $$ \sum\_\alpha \langle\tilde{\delta}\_{i\alpha}^2\rangle = \sum\_\alpha g\_i^2 \langle\delta\_{i\alpha}^2 \rangle c\_\alpha^2 =N\_c \qquad \sum\_i\langle\tilde{\delta}\_{i\alpha}^2\rangle = \sum\_i g\_i^2 \langle\delta\_{i\alpha}^2 \rangle c\_\alpha^2 =N\_g $$ * Given model for $\langle \delta_{i\alpha}^2 \rangle$, above can be solved using ![:emph](Sinkhorn-Knopp). -- * Compute SVD of $\tilde{n}_{i\alpha}$. * Keep components with $\lambda > \lambda\_+$ --- count:false # Generalize this idea to heteroskedastic matrices? #### Poisson count data $\langle \delta^2\_{i\alpha} \rangle = n\_{i\alpha}$ .left-col50[![:scale 550](/figs/devbio/seqspace/normalize/poisson_svd.png)] -- .right-col50[![:scale 550](/figs/devbio/seqspace/normalize/poisson_overlap.png)] --- # Modelling overdispersed count data .left-col50[ #### Variance grows superlinearly ![:scale 525](/figs/devbio/seqspace/normalize/overdispersed.png) ] -- .left-col50[ #### Gene dropout ![:scale 500](/figs/devbio/seqspace/normalize/sparsity.png) ] --- count:false # Modelling overdispersed count data #### Negative binomial distribution for each gene Utilize a generalized linear model to account for sequencing depth $n_\alpha$ variation $$ p(n\_{i\alpha}|\mu\_{i\alpha},\phi_i) = \frac{\Gamma(n+\phi)}{\Gamma(n+1)\Gamma(\phi)} \left(\frac{\mu}{\mu+\phi}\right)^n \left(\frac{\phi}{\mu+\phi}\right)^{\phi}$$ -- The mean of the distribution is given by $$ \log\left(\mu\_{i\alpha}\right) \equiv A\_i + B\_i\log\left(n\_{\alpha}\right) $$ -- Fit $A_i$, $B_i$, and $\phi_i$ per gene by maximum likelihood, given the observed counts per gene. Take care to not overfit for lowly expressed genes! --- count:false # Modelling overdispersed count data #### Negative binomial distribution fits data .center[ ![:scale 700](/figs/devbio/seqspace/normalize/model_fit.png) ] --- # Modelling overdispersed count data #### Normalization schema Unbiased estimator for the variance of negative binomial $$ \langle \delta^2\_{i\alpha} \rangle = \frac{\mu\_{i\alpha} + \phi\_i \mu\_{i\alpha}^2}{1 + \phi\_i} $$ -- The initial estimate for the mean is given by our estimated GLM model $$ \log\left(\mu\_{i\alpha}\right) = A\_i + B\_i\log\left(n\_{\alpha}\right) $$ -- Normalization factors estimated same as Poisson (![:emph](Sinkhorn-Knopp)) $$ \sum\_\alpha g\_i^2 \langle\delta\_{i\alpha}^2 \rangle c\_\alpha^2 =N\_c \qquad \sum\_i g\_i^2 \langle\delta\_{i\alpha}^2 \rangle c\_\alpha^2 =N\_g $$ --- # Method accurately recapitulates toy data .left-col50[ #### Naive SVD doesn't see low rank ![:scale 525](/figs/devbio/seqspace/normalize/negbinom_naive.png) ] --- count:false # Method accurately recapitulates toy data .left-col50[ #### Rescaled SVD does ![:scale 525](/figs/devbio/seqspace/normalize/negbinom.png) ] -- .left-col50[ #### Noisy reconstruction of mean ![:scale 525](/figs/devbio/seqspace/normalize/negbinom_mean.png) ] --- # Drosophila embryo expression fits well .center[ ![:scale 700](/figs/devbio/seqspace/normalize/estimated_rank.png) ] --- # How to map normalized scRNAseq data to space? .center[Berkeley Drosophila Transcription Network Project] .center[ ![:scale 800](/figs/devbio/drosophila/bdtnp.jpg) ] .center[Aligned point clouds of $\sim 80$ genes] --- count: false # How to map normalized scRNAseq data to space? .center[ ![:scale 1000](/figs/devbio/seqspace/data_overview.svg) ] --- count:false # How to map normalized scRNAseq data to space? .center[ ![:scale 1000](/figs/devbio/seqspace/data_mapping.svg) ] -- #### Regularized optimal transport $$ E\left(\rho\_{i\alpha}\right) = \displaystyle\sum\limits\_{i\alpha} \rho\_{i\alpha} J\_{i\alpha} - T\displaystyle\sum\limits\_{i\alpha} \rho\_{i\alpha} \log\left(\rho\_{i\alpha}\right) $$ --- # Accurately reconstructs the database .center[ ![:scale 525](/figs/basel/dev/correlation_vs_temperature_of_fit_gmm_continuous.png) ] --- # Predict the pattern of all remaining genes .center[ ![:scale 342](/figs/devbio/seqspace/disco.png) ![:scale 342](/figs/devbio/seqspace/Kr.png) ![:scale 342](/figs/devbio/seqspace/twi.png) ![:scale 342](/figs/devbio/seqspace/eve.png) ] Shown above are (top) disco, kruppel, and (bottom) twist, eve --- # Acknowledgements .center[ ![:scale 700](/figs/devbio/morpho/acknowledgements.svg) ]