.center[ .vertical-center[ # Computational techniques to analyze scRNAseq data of early development Nicholas Noll Kavli Institute for Theoretical Physics, Santa Barbara, USA ] ] --- # Introduction .center-col33[ .center[  Mechanics of development ] ] -- .center-col33[ .center[  Pathogen evolution ] ] -- .right-col33[ .center[  scRNAseq of development ] ] --- # How is space encoded in the early transcriptome? #### Focus on a specific case .left-col50[ .center[ Early drosophila development  ] ] -- .right-col50[ .center[ Morphogens provide positional information  ] ] --- count:false # How is space encoded in the early transcriptome? #### Focus on a specific case .left-col50[ .center[ French flag model  ] ] .right-col50[ .center[ Morphogens provide positional information  ] ] --- count:false # How is space encoded in the early transcriptome? #### Focus on a specific case .left-col50[ .center[ Transcriptome as spatial function?  ] ] .right-col50[ .center[  ] ] --- # Source of data .center[  ] --- # scRNAseq data requires normalization #### Numerous sources of technical noise .left-col33[ .center[  Large variation across sequencing depth/cell ] ] -- .center-col33[ .center[  Large variation across genes (expression + efficiency) ] ] -- .right-col33[ .center[  Dropout effects ] ] -- Biases are  by a process called . --- # Current normalization techniques #### Active topic of research .left-col50[  ] -- .right-col50[ 3 main flavors * Heurestic: log normalize * External controls: spike-ins * Statistical: Bayesian inference, coarse graining We build upon  methods. ] --- # Our general model of scRNAseq counts Denote measured expression of gene $i$ within cell $\alpha$ as $n_{\alpha i}$ #### Model as: .center[] $$ n\_{i\alpha} = \mu\_{i\alpha} + \delta\_{i\alpha} $$ #### Our goal: * Estimate the low-rank mean $\mu\_{i\alpha}$ * Estimate the measurement variance $\langle \delta\_{i\alpha}^2 \rangle$ --- # How to estimate the decomposition? #### Formalize as a stochastic process * $\delta\_{i\alpha}$ is a $N_g \times N_c$ (large)  matrix -- * Averages to  across runs -- * Captures  the variability in the sequencing process -- * Will require a specific likelihood model to infer #### Complication Sample variance depends on both the cell $\alpha$ and the gene $i$ -- * Homoskedastic: all random variables have equal variance * : variance differs for each random variable -- .center[**We only have one realization of the stochastic process!**] --- # Why does heteroskedasticity make our life difficult? Prevents us from directly using random matrix theory to infer $\delta$. How to see? -- #### Thought experiment: * Assume each element of $\delta$ is sampled from a Gaussian with zero mean and variance $\sigma^2$ * Distribution of singular values $\lambda$ asymptotically follow  distribution -- .left-col50[] -- .right-col50[ * As size of matrix increases, approximation improves. * Has a maximum value: $$ \lambda\_{+} \equiv \sigma\left(\sqrt{N_g} + \sqrt{N_c}\right) $$ ] --- # Spike-in model Consider a rank 1 perturbation of a purely Gaussian random matrix $$ n\_{i\alpha} = \gamma x\_{i} \bar{x}\_\alpha + \delta\_{i\alpha} $$ It has been shown <sup>1</sup> that: .footnote[<sup>1</sup>.cite[D Féral The largest eigenvalue of rank one deformation of large Wigner matrices. (2006)]] * If $\gamma \le \lambda\_{+}$, the top singular value of $n$ converges to $\lambda\_+$ -- * If $\gamma > \lambda\_{+}$, the top singular value of $n$ converges to $\gamma + \lambda\_+/\gamma$ -- * If $\gamma \ge \lambda\_{+}$, the overlap of the top eigenvector of $n$ with $x$ converges to $1 - (\lambda\_+ / \gamma)^2$ -- #### Suggests simple algorithm to detect $\mu_{i\alpha}$ * Compute SVD of $n\_{i\alpha}$ * Keep statistically significant components: $\lambda$ larger than $\lambda_+$ --- count:false # Spike-in model #### Toy data .left-col50[] -- .right-col50[] --- # Generalize this idea to heteroskedastic matrices? Shown <sup>1, 2</sup> the distribution of $\lambda$ of heteroskedastic matrix converges to  if .footnote[<sup>1</sup>.cite[M. Idel. et al (2016), <sup>2</sup>.cite[B. Landa Biwhitening Reveals the Rank of a Count Matrix (2021)]]] -- * Average variance for each row is one -- * Average variance for each column is one -- #### Suggests simple algorithm * Introduce cell $c\_\alpha$ and gene $g\_i$ scale factors, $\tilde{n}\_{i\alpha} \equiv g\_i n\_{i\alpha} c_\alpha$ -- * Obtain by solving $$ \sum\_\alpha \langle\tilde{\delta}\_{i\alpha}^2\rangle = \sum\_\alpha g\_i^2 \langle\delta\_{i\alpha}^2 \rangle c\_\alpha^2 =N\_c \qquad \sum\_i\langle\tilde{\delta}\_{i\alpha}^2\rangle = \sum\_i g\_i^2 \langle\delta\_{i\alpha}^2 \rangle c\_\alpha^2 =N\_g $$ * Given model for $\langle \delta_{i\alpha}^2 \rangle$, above can be solved using . -- * Compute SVD of $\tilde{n}_{i\alpha}$. * Keep components with $\lambda > \lambda\_+$ --- count:false # Generalize this idea to heteroskedastic matrices? #### Poisson count data $\langle \delta^2\_{i\alpha} \rangle = n\_{i\alpha}$ .left-col50[] -- .right-col50[] --- # Modelling overdispersed count data .left-col50[ #### Variance grows superlinearly  ] -- .left-col50[ #### Gene dropout  ] --- count:false # Modelling overdispersed count data #### Negative binomial distribution for each gene Utilize a generalized linear model to account for sequencing depth $n_\alpha$ variation $$ p(n\_{i\alpha}|\mu\_{i\alpha},\phi_i) = \frac{\Gamma(n+\phi)}{\Gamma(n+1)\Gamma(\phi)} \left(\frac{\mu}{\mu+\phi}\right)^n \left(\frac{\phi}{\mu+\phi}\right)^{\phi}$$ -- The mean of the distribution is given by $$ \log\left(\mu\_{i\alpha}\right) \equiv A\_i + B\_i\log\left(n\_{\alpha}\right) $$ -- Fit $A_i$, $B_i$, and $\phi_i$ per gene by maximum likelihood, given the observed counts per gene. Take care to not overfit for lowly expressed genes! --- count:false # Modelling overdispersed count data #### Negative binomial distribution fits data .center[  ] --- # Modelling overdispersed count data #### Normalization schema Unbiased estimator for the variance of negative binomial $$ \langle \delta^2\_{i\alpha} \rangle = \frac{\mu\_{i\alpha} + \phi\_i \mu\_{i\alpha}^2}{1 + \phi\_i} $$ -- The initial estimate for the mean is given by our estimated GLM model $$ \log\left(\mu\_{i\alpha}\right) = A\_i + B\_i\log\left(n\_{\alpha}\right) $$ -- Normalization factors estimated same as Poisson () $$ \sum\_\alpha g\_i^2 \langle\delta\_{i\alpha}^2 \rangle c\_\alpha^2 =N\_c \qquad \sum\_i g\_i^2 \langle\delta\_{i\alpha}^2 \rangle c\_\alpha^2 =N\_g $$ --- # Method accurately recapitulates toy data .left-col50[ #### Naive SVD doesn't see low rank  ] --- count:false # Method accurately recapitulates toy data .left-col50[ #### Rescaled SVD does  ] -- .left-col50[ #### Noisy reconstruction of mean  ] --- # Drosophila embryo expression fits well .center[  ] --- # How to map normalized scRNAseq data to space? .center[Berkeley Drosophila Transcription Network Project] .center[  ] .center[Aligned point clouds of $\sim 80$ genes] --- count: false # How to map normalized scRNAseq data to space? .center[  ] --- count:false # How to map normalized scRNAseq data to space? .center[  ] -- #### Regularized optimal transport $$ E\left(\rho\_{i\alpha}\right) = \displaystyle\sum\limits\_{i\alpha} \rho\_{i\alpha} J\_{i\alpha} - T\displaystyle\sum\limits\_{i\alpha} \rho\_{i\alpha} \log\left(\rho\_{i\alpha}\right) $$ --- # Accurately reconstructs the database .center[  ] --- # Predict the pattern of all remaining genes .center[     ] Shown above are (top) disco, kruppel, and (bottom) twist, eve --- # Acknowledgements .center[  ]