presentation

.center[
.vertical-center[
# Computational techniques to analyze scRNAseq data of early development

Nicholas Noll

Kavli Institute for Theoretical Physics, Santa Barbara, USA
]
]

---

# Introduction

.center-col33[
.center[
![:scale 330](/figs/devbio/morpho/inference_embryo.svg)
Mechanics of development
]
]

.center-col33[
.center[
![:scale 330](/figs/basel/timetree.png)
Pathogen evolution
]
]

.right-col33[
.center[
![:scale 190](/figs/devbio/seqspace/encoder-rotated.svg)

scRNAseq of development
]
]

---

# How is space encoded in the early transcriptome?

#### Focus on a specific case

.left-col50[
.center[
Early drosophila development
![:scale 440](/figs/devbio/drosophila/embryo.png)
]
]

.right-col50[
.center[
Morphogens provide positional information
![:scale 410](/figs/devbio/drosophila/bicoid.png)
]
]

---
count:false

# How is space encoded in the early transcriptome?

#### Focus on a specific case

.left-col50[
.center[
French flag model
![:scale 410](/figs/devbio/drosophila/frenchflag.png)
]
]

.right-col50[
.center[
Morphogens provide positional information
![:scale 410](/figs/devbio/drosophila/bicoid.png)
]
]

---
count:false

# How is space encoded in the early transcriptome?

#### Focus on a specific case

.left-col50[
.center[
Transcriptome as spatial function?
![:scale 440](/figs/devbio/drosophila/embryo.png)
]
]

.right-col50[
.center[
![:scale 400](/figs/devbio/drosophila/scrnaseq.png)
]
]

---

# Source of data

&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;

.center[
![:scale 650](/figs/devbio/seqspace/drosophila_single_cell_resolution_header.png)
]

---

# scRNAseq data requires normalization

#### Numerous sources of technical noise
.left-col33[
.center[
![:scale 330](/figs/devbio/seqspace/normalize/depth.png)
Large variation across sequencing depth/cell
]
]

.center-col33[
.center[
![:scale 330](/figs/devbio/seqspace/normalize/gene_expression.png)
Large variation across genes (expression + efficiency)
]
]

.right-col33[
.center[
![:scale 330](/figs/devbio/seqspace/normalize/gene_dropout.png)
Dropout effects
]
]

Biases are ![:emph](rectified) by a process called ![:emph](normalization).

---

# Current normalization techniques

#### Active topic of research

.left-col50[
![:scale 500](/figs/devbio/seqspace/normalize/method.png)
]
--
.right-col50[
&nbsp; &nbsp; &nbsp; &nbsp;

3 main flavors
  * Heurestic: log normalize
  * External controls: spike-ins
  * Statistical: Bayesian inference, coarse graining

We build upon ![:emph](previous likelihood) methods.
]

---

# Our general model of scRNAseq counts

Denote measured expression of gene $i$ within cell $\alpha$ as $n_{\alpha i}$

#### Model as:
.center[![:scale 600](/figs/devbio/seqspace/normalize/matrix_cartoon.png)]
$$ n\_{i\alpha} = \mu\_{i\alpha} + \delta\_{i\alpha} $$

#### Our goal:
  * Estimate the low-rank mean $\mu\_{i\alpha}$
  * Estimate the measurement variance $\langle \delta\_{i\alpha}^2 \rangle$
---

# How to estimate the decomposition?

#### Formalize as a stochastic process
  * $\delta\_{i\alpha}$ is a $N_g \times N_c$ (large) ![:emph](random) matrix
--
  * Averages to ![:emph](zero) across runs
--
  * Captures ![:emph](only) the variability in the sequencing process
--
  * Will require a specific likelihood model to infer

#### Complication
Sample variance depends on both the cell $\alpha$ and the gene $i$
--
  * Homoskedastic: all random variables have equal variance
  * ![:emph](Heteroskedastic): variance differs for each random variable
--

.center[**We only have one realization of the stochastic process!**]

---
# Why does heteroskedasticity make our life difficult?
Prevents us from directly using random matrix theory to infer $\delta$. How to see?
--

#### Thought experiment:
  * Assume each element of $\delta$ is sampled from a Gaussian with zero mean and variance $\sigma^2$
  * Distribution of singular values $\lambda$ asymptotically follow ![:emph](Marchenko-Pastur) distribution
--

.left-col50[![:scale 500](/figs/devbio/seqspace/normalize/marchenko-pastur.png)]

--
.right-col50[
* As size of matrix increases, approximation improves.
* Has a maximum value:

$$ \lambda\_{+} \equiv \sigma\left(\sqrt{N_g} + \sqrt{N_c}\right) $$
]

---

# Spike-in model

Consider a rank 1 perturbation of a purely Gaussian random matrix
$$ n\_{i\alpha} = \gamma x\_{i} \bar{x}\_\alpha + \delta\_{i\alpha} $$
It has been shown 1 that:
.footnote[1.cite[D Féral The largest eigenvalue of rank one deformation of large Wigner matrices. (2006)]]

* If $\gamma \le \lambda\_{+}$, the top singular value of $n$ converges to $\lambda\_+$
--
* If $\gamma > \lambda\_{+}$, the top singular value of $n$ converges to $\gamma + \lambda\_+/\gamma$
--
* If $\gamma \ge \lambda\_{+}$, the overlap of the top eigenvector of $n$ with $x$ converges to $1 - (\lambda\_+ / \gamma)^2$
--

#### Suggests simple algorithm to detect $\mu_{i\alpha}$
* Compute SVD of $n\_{i\alpha}$
* Keep statistically significant components: $\lambda$ larger than $\lambda_+$

---
count:false

# Spike-in model

#### Toy data

.left-col50[![:scale 550](/figs/devbio/seqspace/normalize/gaussian_svd.png)]
--
.right-col50[![:scale 550](/figs/devbio/seqspace/normalize/gaussian_overlap.png)]

---

# Generalize this idea to heteroskedastic matrices?
Shown 1, 2 the distribution of $\lambda$ of heteroskedastic matrix converges to ![:emph](Marchenko-Pastur) if
.footnote[1.cite[M. Idel. et al (2016), 2.cite[B. Landa Biwhitening Reveals the Rank of a Count Matrix (2021)]]]
--
 * Average variance for each row is one
--
 * Average variance for each column is one
--

#### Suggests simple algorithm
* Introduce cell $c\_\alpha$ and gene $g\_i$ scale factors, $\tilde{n}\_{i\alpha} \equiv g\_i n\_{i\alpha} c_\alpha$
--
* Obtain by solving
$$ \sum\_\alpha \langle\tilde{\delta}\_{i\alpha}^2\rangle = \sum\_\alpha g\_i^2 \langle\delta\_{i\alpha}^2 \rangle c\_\alpha^2 =N\_c \qquad \sum\_i\langle\tilde{\delta}\_{i\alpha}^2\rangle = \sum\_i g\_i^2 \langle\delta\_{i\alpha}^2 \rangle c\_\alpha^2 =N\_g $$
* Given model for $\langle \delta_{i\alpha}^2 \rangle$, above can be solved using ![:emph](Sinkhorn-Knopp).
--
* Compute SVD of $\tilde{n}_{i\alpha}$.
* Keep components with $\lambda > \lambda\_+$

---
count:false

# Generalize this idea to heteroskedastic matrices?
#### Poisson count data $\langle \delta^2\_{i\alpha} \rangle = n\_{i\alpha}$

.left-col50[![:scale 550](/figs/devbio/seqspace/normalize/poisson_svd.png)]
--
.right-col50[![:scale 550](/figs/devbio/seqspace/normalize/poisson_overlap.png)]

---

# Modelling overdispersed count data
.left-col50[
#### Variance grows superlinearly
![:scale 525](/figs/devbio/seqspace/normalize/overdispersed.png)
]
--
.left-col50[
#### Gene dropout
![:scale 500](/figs/devbio/seqspace/normalize/sparsity.png)
]

---
count:false

# Modelling overdispersed count data
#### Negative binomial distribution for each gene
Utilize a generalized linear model to account for sequencing depth $n_\alpha$ variation

$$ p(n\_{i\alpha}|\mu\_{i\alpha},\phi_i) = \frac{\Gamma(n+\phi)}{\Gamma(n+1)\Gamma(\phi)} \left(\frac{\mu}{\mu+\phi}\right)^n \left(\frac{\phi}{\mu+\phi}\right)^{\phi}$$

The mean of the distribution is given by
$$ \log\left(\mu\_{i\alpha}\right) \equiv A\_i + B\_i\log\left(n\_{\alpha}\right) $$
--

Fit $A_i$, $B_i$, and $\phi_i$ per gene by maximum likelihood, given the observed counts per gene.

Take care to not overfit for lowly expressed genes!

---
count:false

# Modelling overdispersed count data
#### Negative binomial distribution fits data

.center[
![:scale 700](/figs/devbio/seqspace/normalize/model_fit.png)
]

---

# Modelling overdispersed count data
#### Normalization schema
Unbiased estimator for the variance of negative binomial
$$ \langle \delta^2\_{i\alpha} \rangle = \frac{\mu\_{i\alpha} + \phi\_i \mu\_{i\alpha}^2}{1 + \phi\_i} $$
--
The initial estimate for the mean is given by our estimated GLM model
$$ \log\left(\mu\_{i\alpha}\right) = A\_i + B\_i\log\left(n\_{\alpha}\right) $$
--
Normalization factors estimated same as Poisson (![:emph](Sinkhorn-Knopp))
$$ \sum\_\alpha g\_i^2 \langle\delta\_{i\alpha}^2 \rangle c\_\alpha^2 =N\_c \qquad \sum\_i g\_i^2 \langle\delta\_{i\alpha}^2 \rangle c\_\alpha^2 =N\_g $$

---

# Method accurately recapitulates toy data
.left-col50[
#### Naive SVD doesn't see low rank
![:scale 525](/figs/devbio/seqspace/normalize/negbinom_naive.png)
]

---
count:false

# Method accurately recapitulates toy data
.left-col50[
#### Rescaled SVD does
![:scale 525](/figs/devbio/seqspace/normalize/negbinom.png)
]

.left-col50[
#### Noisy reconstruction of mean
![:scale 525](/figs/devbio/seqspace/normalize/negbinom_mean.png)
]

---

# Drosophila embryo expression fits well

.center[
![:scale 700](/figs/devbio/seqspace/normalize/estimated_rank.png)
]

---
# How to map normalized scRNAseq data to space?

.center[Berkeley Drosophila Transcription Network Project]
.center[
![:scale 800](/figs/devbio/drosophila/bdtnp.jpg)
]
.center[Aligned point clouds of $\sim 80$ genes]

---
count: false

# How to map normalized scRNAseq data to space?

.center[
![:scale 1000](/figs/devbio/seqspace/data_overview.svg)
]

---

count:false
# How to map normalized scRNAseq data to space?

.center[
![:scale 1000](/figs/devbio/seqspace/data_mapping.svg)
]

#### Regularized optimal transport
$$ E\left(\rho\_{i\alpha}\right) = \displaystyle\sum\limits\_{i\alpha} \rho\_{i\alpha} J\_{i\alpha} - T\displaystyle\sum\limits\_{i\alpha} \rho\_{i\alpha} \log\left(\rho\_{i\alpha}\right) $$

---

# Accurately reconstructs the database

.center[
![:scale 525](/figs/basel/dev/correlation_vs_temperature_of_fit_gmm_continuous.png)
]

---

# Predict the pattern of all remaining genes
.center[
![:scale 342](/figs/devbio/seqspace/disco.png)
![:scale 342](/figs/devbio/seqspace/Kr.png)
![:scale 342](/figs/devbio/seqspace/twi.png)
![:scale 342](/figs/devbio/seqspace/eve.png)
]
Shown above are (top) disco, kruppel, and (bottom) twist, eve

---

# Acknowledgements
.center[
![:scale 700](/figs/devbio/morpho/acknowledgements.svg)
]