presentation

.center[
.vertical-center[
# Learning the geometry of expression in early Drosophila embryogenesis

Nicholas Noll

Kavli Institute for Theoretical Physics, Santa Barbara, USA
]
]

???
* Hello everyone!
* As the slide suggests, my name is Nicholas Noll, currently a postdoctoral fellow at the KITP in sunny Santa Barbara.
* I am happy and excited to be hear to tell you a bit about my recent, unpublished work looking into how positional information is encoded into the transcriptome of early development.
* I want to emphasize that questions and interruptions are welcome and strongly encouraged!

---

# Introduction

.center-col33[
.center[
![:scale 330](/figs/devbio/morpho/inference_embryo.svg)
Mechanics of development
]
]

???
* I thought it would be nice just to give you a brief overview of my past work before diving into the most recent project.
* I started my journey into Biology during my PhD work which primarily centered on understanding/inferring the forces at play during early morphogenesis.
* As its hard to measure forces experimentally, I formulated a model and subsequently an inference algorithm to infer the stress tensor based upon the observed cellular geometry.

.center-col33[
.center[
![:scale 330](/figs/basel/timetree.png)
Pathogen evolution
]
]

???
* My first postdoc was a departure from develomental biology.
* I joined the group of Richard Neher to study microbial evolution.
* Launched a sequencing study in collaboration with the local clinicians to de novo assemble antibiotic resistant bacterial genomes.
* Wanted to understand/infer patterns of HGT in the context of ARG evolution.
* In process, wrote an algorithm to align whole genomes into a graph based data structure.

.right-col33[
.center[
![:scale 190](/figs/devbio/seqspace/encoder-rotated.svg)

scRNAseq of development
]
]

???
* Finally, here at the KITP I have embarked on a project that has combined my experience with NGS and developmental biology.
* This is the story I plan to speak on today.

---
# Acknowledgements
.center[
![:scale 700](/figs/devbio/seqspace/acknowledgements.svg)
]

???
* I want to start by acknowledging my collaborators for this project.
* They have been incredibly helpful bouncing ideas off of and proposing new experiments to test the ideas.

---

# How do cells know where they are in space?

#### Focus on Drosophila embryogenesis

.left-col50[
.center[
Early drosophila development
![:scale 440](/figs/devbio/drosophila/embryo.png)
]
]

???
* A central question in developmental biology considers the nature of how cells can "know" where they are in space.
* More pointedly let's focus on the model organism, Drosophila melogonaster (fruit fly), shown here at ~2hours PF.
* Roughly 6000 cells within a simple epithelial monolayer that covers the surface of the yolk.
* The embryo was caught just at the onset of gastrulation: can be seen by the initial formation of the cephalic furrow and posterior midgut.
* Belly(ventral) on the bottom, Back(dorsal) at top. Head(anterior) to the left, tail (posterior) to the right.
* Reframe the general question specifically: how does the embryo break head/tail symmetry?

.right-col50[
.center[
Morphogens provide positional information
![:scale 410](/figs/devbio/drosophila/bicoid.png)
]
]

???
* Answer: the mother breaks it!
* Nobel prize winning work from Eric Wieschaus and Christiane Nusslein Vollhard identified maternal mRNAs that are deposited into the head(bicoid)/tail(nanos) pole of each egg.
* These mRNAs diffuse and degrade (are used to build proteins by the nuclei in the bastula), setting up an "exponential"/monotonic profile in space.
* In green I am showing Bicoid, the first identified developmental morphogen. As named, if knocked out in the mother, will result in an embryo that forms with two tails.
* Strongly binary phenotype.
* Cells can "measure" distance from head pole by measuring concentration.

---
count:false

# How do cells know where they are in space?

#### Focus on Drosophila embryogenesis

.left-col50[
.center[
French flag model
![:scale 440](/figs/devbio/drosophila/frenchflag.png)
]
]

.right-col50[
.center[
Morphogens provide positional information
![:scale 410](/figs/devbio/drosophila/bicoid.png)
]
]

???
* Bicoid itself is a transcription factor that activates many downstream targets.
* "Measurement" is often conceptualized in a breakpoint model, i.e. the french flag of development.
* Simple algorithm for fate specification: if concentration greater than cutoff, adopt fate A. else fate B
* Here, in blue, I show hunchback, which plausibly follows the paradigm.

---
count:false

# How do cells know where they are in space?

#### Focus on Drosophila embryogenesis

.left-col50[
.center[
French flag model
![:scale 440](/figs/devbio/drosophila/frenchflag.png)
]
]

.right-col50[
.center[
Positional information informs fate
![:scale 495](/figs/devbio/drosophila/ap_signaling_cascade.png)
]
]

???
* The actual world is more complicated!
* The result of Eric and Christiane's work was the elucidation of the entire AP patterning system, shown here.
* Conceptualized as a hierarchical transduction network: bcd -> gap genes -> pair rule genes -> segment polarity
* Broad range positional information is refined into fine, reproducible patterns of cell fate.
* Segments directly correspond to future larvel/pupal body segments.

---
count:false

# How do cells know where they are in space?

#### Theory of Positional information

.left-col50[
.center[
L. Wolpert. Positional Information and Pattern Formation in Development
![:scale 500](/figs/devbio/seqspace/wolpert_positional_information.png)
]
]

???
* The progenitor of the "discrete" framework of positional information is Lewis Wolpert.
* All of morphogenesis and pattern formation can be conceptualized as a system that intercalates positional values to be smooth function.
* Proposed cells acquire positional information with respect to discrete boundaries, rather than a true position.
* Succintly: Position is an integer not a real number.

.right-col50[
.center[
![:scale 500](/figs/devbio/seqspace/bialek_positional_info_header.png)
![:scale 450](/figs/devbio/seqspace/bialek_positional_info_fig.png)
]
* Positional information is constant 
* Determined by gap genes
]

???
* An updated, quantitative take on the same system was published a decade ago from Bialek et al.
* Immunoflorescence antibody staining of the 4 gap genes taken simultaneously.
* Focused on expression along the AP axis of the midsagittal plane.
* Measured across ensemble of embryos, compute the mutual information between the 4 gap genes and AP position.
* Claim positional information is constant and specifies each cell's identity up to a cell size.
* Real number, not integer!

---

# How is space encoded in the early transcriptome?

#### Focus on Drosophila embryogenesis

.left-col50[
.center[
Transcriptome as spatial function?
![:scale 440](/figs/devbio/drosophila/embryo.png)
]
]

.right-col50[
.center[
![:scale 400](/figs/devbio/seqspace/pointcloud.svg)
]
]

???
* I want to reformulate this question in the age of scRNAseq.
* The thought experiment I wish to perform is go back to the original drosophila embryo at the onset of gastrulation.
* Imagine we could measure the gene expression for each cell in the epithelial monolayer.
* Collect into a point cloud.

---
count:false

# How is space encoded in the early transcriptome?

#### Focus on Drosophila embryogenesis

.left-col50[
.center[
Transcriptome as spatial function?
![:scale 440](/figs/devbio/drosophila/embryo.png)
]
]

.right-col50[
.center[
![:scale 400](/figs/devbio/seqspace/pointcloud_surface.svg)
]
]

???
* If there is a true "mean" gene expression that is a function of spatial coordinates, then the point cloud expression data can be best thought of as sampled from a low-dimensional manifold.
* When viewed this way, it almost becomes as a canonical dimensional reduction problem.

---
count:false

# How is space encoded in the early transcriptome?

#### Focus on Drosophila embryogenesis

.left-col50[
.center[
Continuum transcriptome ansatz
![:scale 440](/figs/devbio/drosophila/embryo_AP.png)
]
]

.right-col50[
.center[
![:scale 400](/figs/devbio/seqspace/pointcloud_surface_coordinates.svg)
]
]

???
* Moreover, we make a strong ansatz: the primary directions of this manifold _are_ space!
* Thus, if this is indeed true, we should be able to de novo discover the AP axis from the distribution of expression point cloud.
* Thus, if this is indeed true, we could be able to "infer" space from this surface.

---
count:false

# How is space encoded in the early transcriptome?

#### Focus on Drosophila embryogenesis

.left-col50[
.center[
Pose as inference problem
![:scale 440](/figs/devbio/drosophila/embryo.png)
]
]

.right-col50[
.center[
![:scale 400](/figs/devbio/drosophila/scrnaseq.png)
]
]

???
* This is indeed how we pose the problem, given a bag of cells with sequenced transcriptomes obtained from scRNAseq, can we infer where they were sampled from on the original embryo.

---

# Source of data

&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;

.center[
![:scale 650](/figs/devbio/seqspace/drosophila_single_cell_resolution_header.png)
]

???
* All data discussed today was obtained from public NCBI GEO repositories.
* Generated from the 2017 drop-seq study shown here.

---

# Three components to our approach

.left-col33[
.center[
**scRNAseq normalization**
![:scale 200](/figs/devbio/seqspace/normalize/method_splice.png)
]
]

???
* Time permitting, I hope to discuss the three main pillars of this study with you today.
* The first component is a novel normalization scheme based upon random matrix theory.
* Linear transformation of the data that stabilizes the variance unlike problematic log normalization.

.center-col33[
.center[
**Supervised inference**
![:scale 350](/figs/basel/dev/eve.png)
*Predicted Eve expression*
]
]

???
* Another component will outline how to match scRNAseq data to a known database.
* I label this section as a "supervised" inference.
* Generate a set of "known" labels to subsequently check our learned manifold against.

.right-col33[
.center[
**Unsupervised inference**
![:scale 190](/figs/devbio/seqspace/encoder-rotated.svg)
]
]

???
* Lastly, I will describe our novel approach to manifold learning in the context of scRNAseq data.
* Autoencoder architecture regularized by constrained pairwise distance.
* With that said, let's dive in!

---

# scRNAseq data requires normalization

#### Numerous sources of technical noise
.left-col33[
.center[
![:scale 330](/figs/devbio/seqspace/normalize/depth.png)
Large variation across sequencing depth/cell
]
]

???
* scRNAseq data critically requires preprocessing steps to enable downstream analysis with minimal technical artifacts.
* These preprocessing steps are an attempt to "correct" for technical noise in the sequencing process.
* Show up as features in the data.
* Hetergeneous sequencing depth owing to variable reaction efficiencies.
* Spans 2 decades

.center-col33[
.center[
![:scale 330](/figs/devbio/seqspace/normalize/gene_expression.png)
Large variation across genes (expression + efficiency)
]
]

???
* Hetergeneous expression count across genes (real and technical).
* Spans 4 decades

.right-col33[
.center[
![:scale 330](/figs/devbio/seqspace/normalize/gene_dropout.png)
Dropout effects
]
]

???
* Overdispersed!
* Shows up as a nonlinear mean vs variance relationship and an overinflation of zero counts relative to Poisson expectation.
* Coined "dropout".

Biases are ![:emph](rectified) by a process called ![:emph](normalization).

We build upon ![:emph](previous likelihood) methods.

---

# Our general model of scRNAseq counts

Denote measured expression of gene $i$ within cell $\alpha$ as $n_{\alpha i}$

#### Model as:
.center[![:scale 600](/figs/devbio/seqspace/normalize/matrix_cartoon.png)]
$$ n\_{i\alpha} = \mu\_{i\alpha} + \delta\_{i\alpha} $$

???
* Roman letters index genes (rows). Greek index cells (columns).
* Critically we assume that the true count matrix is "low" rank compared to the size of the full count matrix.
* Can view this as a linearized version of our positional manifold ansatz.
* The full rank of the measured count matrix then comes from the technical noise $\delta$.

#### Our goal:
  * Estimate the low-rank mean $\mu\_{i\alpha}$
  * Estimate the measurement variance $\langle \delta\_{i\alpha}^2 \rangle$

???
* Normalization can be viewed as estimating this decomposition.
* Importantly, the brackets denote an average over the _theoretical_ ensemble of sequencing the same batch.
* As such, it will ultimately require an inference based upon a Bayesian prior.

---

# How to estimate the decomposition?

#### Formalize as a stochastic process
  * $\delta\_{i\alpha}$ is a $N_g \times N_c$ (large) ![:emph](random) matrix
  * Averages to ![:emph](zero) across runs
  * Captures ![:emph](only) the variability in the sequencing process
  * Will require a specific likelihood model to infer

???
* To this end, we formalize the data as a generative stochastic model with the following properties:

#### Complication
Sample variance depends on both the cell $\alpha$ and the gene $i$

???
* Both gene expression and cell depth are themselves stochastic variables.
* Reasonable to assume each element of matrix $\delta$ samples from a _different_ distribution.

--
  * Homoskedastic: all random variables have equal variance
  * ![:emph](Heteroskedastic): variance differs for each random variable

???
* Important jargon going forward

.center[**We only have one realization of the stochastic process!**]

???
* This complicates the inference! Can't use independent elements as samples from the same ensemble.
* Only have one realization to draw inferences from!
* We will build this inference up piecewise.
* Start from a homoskedastic matrix and then generalize to our true model of the data.

---
# Why does heteroskedasticity make our life difficult?
Prevents us from directly using random matrix theory to infer $\delta$. How to see?
--

#### Thought experiment:
  * Assume each element of $\delta$ is sampled from a Gaussian with zero mean and variance $\sigma^2$
  * Distribution of singular values $\lambda$ asymptotically follow ![:emph](Marchenko-Pastur) distribution
--

.left-col50[![:scale 500](/figs/devbio/seqspace/normalize/marchenko-pastur.png)]

--
.right-col50[
* As size of matrix increases, approximation improves.
* Has a maximum value:

$$ \lambda\_{+} \equiv \sigma\left(\sqrt{N_g} + \sqrt{N_c}\right) $$
]

---

# Spike-in model

Consider a rank 1 perturbation of a purely ![:emph](Gaussian) random matrix
$$ n\_{i\alpha} = \gamma x\_{i} \bar{x}\_\alpha + \delta\_{i\alpha} $$
It has been shown 1 that:
.footnote[1.cite[D Féral The largest eigenvalue of rank one deformation of large Wigner matrices. (2006)]]

--
* If $\gamma \le \lambda\_{+}$, the top singular value of $n$ converges to $\lambda\_+$
--
* If $\gamma > \lambda\_{+}$, the top singular value of $n$ converges to $\gamma + \lambda\_+/\gamma$
* If $\gamma \ge \lambda\_{+}$, the overlap of the top eigenvector of $n$ with $x$ converges to $1 - (\lambda\_+ / \gamma)^2$

???
* Phase transition at $\gamma = \lambda\_{+}$
* Top principal component will now limit to the perturbation.

#### Suggests simple algorithm to detect $\mu_{i\alpha}$
* Compute SVD of $n\_{i\alpha}$
* Keep statistically significant components: $\lambda$ larger than $\lambda_+$

---
count:false

# Spike-in model

#### Toy data

.left-col50[![:scale 550](/figs/devbio/seqspace/normalize/gaussian_svd.png)]

???
* Simple to verify the validity of the algorithm empirically.
* Worthwhile to walk through the plot as it will come up again.
* On the vertical axis is the component rank, descending order (1st being the largest principal value).
* On the horizontal axis is the singular value.
* Both are log scale.
* Yellow and the green dashed lines are the known spectra of the decomposition.

--
.right-col50[![:scale 550](/figs/devbio/seqspace/normalize/gaussian_overlap.png)]

???
* Great agreement between the predicted and known mean!
* However, this is limited to homoskedastic matrices.

---

# Generalize this idea to heteroskedastic matrices?
Shown 1, 2 the distribution of $\lambda$ of heteroskedastic matrix converges to ![:emph](Marchenko-Pastur) if
.footnote[1.cite[M. Idel. et al (2016), 2.cite[B. Landa Biwhitening Reveals the Rank of a Count Matrix (2021)]]]
 * Average variance for each row is one
 * Average variance for each column is one

???
* Recent work has shown the eigenvalues of a (constrained) heteroskedastic matrix have similar properties.
* Specifically, a double stochastic matrix, i.e. the row and column sum are 1.
--

#### Suggests simple algorithm
* Introduce cell $c\_\alpha$ and gene $g\_i$ scale factors, $\tilde{n}\_{i\alpha} \equiv g\_i n\_{i\alpha} c_\alpha$
???
* Tildes denote a "rescaled" variant of each quantity.
--
* Obtain by solving
$$ \sum\_\alpha \langle\tilde{\delta}\_{i\alpha}^2\rangle = \sum\_\alpha g\_i^2 \langle\delta\_{i\alpha}^2 \rangle c\_\alpha^2 =N\_c \qquad \sum\_i\langle\tilde{\delta}\_{i\alpha}^2\rangle = \sum\_i g\_i^2 \langle\delta\_{i\alpha}^2 \rangle c\_\alpha^2 =N\_g $$
* Given model for $\langle \delta_{i\alpha}^2 \rangle$, above can be solved using ![:emph](Sinkhorn-Knopp).
???
* Sinkhorn-Knopp is an iterative algorithm that alternatives between rescaling rows and columns to enforce the double stochastic constraint.
--
* Compute SVD of $\tilde{n}_{i\alpha}$.
* Keep components with $\lambda > \lambda\_+$

---

# Modelling overdispersed count data
#### Negative binomial distribution for ![:emph](each gene)
Utilize a generalized linear model to account for sequencing depth $n_\alpha$ variation

$$ p(n\_{i\alpha}|\mu\_{i\alpha},\phi_i) = \frac{\Gamma(n+\phi)}{\Gamma(n+1)\Gamma(\phi)} \left(\frac{\mu}{\mu+\phi}\right)^n \left(\frac{\phi}{\mu+\phi}\right)^{\phi}$$
???
* We write a negative binomial parameterized in terms of mean $\mu$ and overdispersion factor $\phi$.
* Usually see this for discrete $\phi$ and probability of success $p$.

The mean of the distribution is given by
$$ \log\left(\mu\_{i\alpha}\right) \equiv A\_i + B\_i\log\left(n\_{\alpha}\right) $$
???
* $n\_\alpha$ is the sequencing depth for cell $\alpha$.
* $B\_i$ controls the power law of scaling. Strong prior to be 1
* $A\_i$ sets the scale for each gene.

Fit $A_i$, $B_i$, and $\phi_i$ per gene by maximum likelihood, given the observed counts per gene.

Take care to not overfit for lowly expressed genes!

---
count:false

# Modelling overdispersed count data
#### Negative binomial distribution fits data

.center[
![:scale 700](/figs/devbio/seqspace/normalize/model_fit.png)
]

---

# Modelling overdispersed count data
#### Putting it all together: Normalization schema
Unbiased estimator for the variance of negative binomial
$$ \langle \delta^2\_{i\alpha} \rangle = \frac{\mu\_{i\alpha} + \phi\_i \mu\_{i\alpha}^2}{1 + \phi\_i} $$
--
The initial estimate for the mean is given by our estimated GLM model
$$ \log\left(\mu\_{i\alpha}\right) = A\_i + B\_i\log\left(n\_{\alpha}\right) $$
--
Normalization factors estimated by (![:emph](Sinkhorn-Knopp))
$$ \sum\_\alpha g\_i^2 \langle\delta\_{i\alpha}^2 \rangle c\_\alpha^2 =N\_c \qquad \sum\_i g\_i^2 \langle\delta\_{i\alpha}^2 \rangle c\_\alpha^2 =N\_g $$

---

# Method accurately recapitulates toy data
.left-col50[
#### Naive SVD doesn't see low rank
![:scale 525](/figs/devbio/seqspace/normalize/negbinom_naive.png)
]

???
* No obvious knee. Maybe rank 1 or 2.
* Overdispersed variance causes a huge overestimation of rank by MP.

---
count:false

# Method accurately recapitulates toy data
.left-col50[
#### Rescaled SVD does
![:scale 525](/figs/devbio/seqspace/normalize/negbinom.png)
]
???
* Rescaling discovers the majority of the low rank mean.
* Imperfect: Components below the MP cutoff are not detected.

.left-col50[
#### Noisy reconstruction of mean
![:scale 525](/figs/devbio/seqspace/normalize/negbinom_mean.png)
]
???
* Mean can be inferred! Much noisier though.
* Higher variance due from missing components _and_ propagated error from uncertainty in $\phi$.

---

# Drosophila embryo expression fits well

&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;

.left-col50[
![:scale 520](/figs/devbio/seqspace/normalize/estimated_rank.png)
]
???
* Discover 45 statistically significant components
* No sharp knee. Reasonable to assume we are missing degrees of freedom below the noise.

.right-col50[
![:scale 520](/figs/devbio/seqspace/participation_ratio.svg)
]
???
* Interestingly, these 45 components are _not_ localized on a few genes.
* I show the participation ratio here, a measure of localization.
* Can view it as roughly corresponding to the number of genes that actively contribute to each eigenvector.
* Each component involves $250-1000$ genes! Not $4$ gap genes.
* Not as fully delocalized as the noise though. Some functional group structure detected.

---

.center[
.vertical-center[
# Can we leverage existing databases to embed scRNAseq into space?
]
]

---
# How to map normalized scRNAseq data to space?

.center[Berkeley Drosophila Transcription Network Project]
.center[
![:scale 800](/figs/devbio/drosophila/bdtnp.jpg)
]
.center[Aligned point clouds of $\sim 80$ genes]

???
* Over a decade ago, the Berkeley Drosophila Transcription Network Project set up to understand the complex network of transcriptional regulation required for development.
* Used Drosophila melongaster as a model system to explore formation of expression patterns.
* Took coarse time-series cohort data.
* Confocal microscopy of florescence data generated a point cloud of a handful of genes at a time.
* Arduous task of aligning the various point clouds to a common reference embryo.
* Result is a point cloud of 3D expression patterns.

---
count: false

# How to map normalized scRNAseq data to space?

.center[
![:scale 1000](/figs/devbio/seqspace/data_overview_pointcloud.svg)
]

???
* Recall the shape of our scRNAseq data is a normalized table of genes by cells.
* The database gives us 3D point clouds of expression.
* Here I show just one "channel", eve.

---
count: false

# How to map normalized scRNAseq data to space?

.center[
![:scale 1000](/figs/devbio/seqspace/data_overview.svg)
]

???
* Can view this just as another matrix.
* Here positions, or virtual cells, are just different columns in the matrix.
* Don't model space explictly. Map to the points within the point cloud.
* Conceptualize this as mapping one bag of cells to another, where one bag of cells happens to have spatial labels.

---

count:false
# How to map normalized scRNAseq data to space?

.center[
![:scale 1000](/figs/devbio/seqspace/data_mapping.svg)
]

???
* Formulate a probabilistic model, i.e. whats the probability each cell in our scRNAseq dataset was sampled from each position in our BDTNP dataset.
* View this as finding a probability distribution for each column of our scRNAseq table over columns of our database table.

#### Regularized optimal transport
$$ E\left(\rho\_{i\alpha}\right) = \displaystyle\sum\limits\_{i\alpha} \rho\_{i\alpha} J\_{i\alpha} + T\displaystyle\sum\limits\_{i\alpha} \rho\_{i\alpha} \log\left(\rho\_{i\alpha}\right) + \text{marginal constraints}$$

???
* Best framed in the optimal transport language: what's the minimal "cost" to assign each scRNAseq cell to a position on the embryo.
* It is not reasonable to search for a bijection here! Instead want the probability over positions.
* Regularize it by penalizing the entropy of the distribution.
* Solution is found again by Sinkhorn-Knopp, once we compute the cost matrix $J$.

---

# Recapitulate the in-situ database

.left-col50[
![:scale 505](/figs/basel/dev/correlation_vs_temperature_of_fit_gmm_continuous.png)
]

???
* Linear correlation of predicted mean expression pattern of mapped scRNAseq data to BDTNP database as a function of inverse temperature.
* Non-monotonic: infinite temperature is the uniform solution, zero temperature is singular and susceptible to noise.
* Error bars are standard deviation over database genes.
* Capture $\sim70\%$ of the variation.

.right-col50[
![:scale 505](/figs/basel/morpho/optimal_transport_mapping_entropy.png)
]

???
* Higher temperature implies a less resolved map.
* Each cell is mapped to $~30$ possible positions.
* Not as precise as the princeton survey.

---

# Drosophila expression at "single-cell" resolution
.center[
![:scale 342](/figs/devbio/seqspace/disco.png)
![:scale 342](/figs/devbio/seqspace/Kr.png)
![:scale 342](/figs/devbio/seqspace/twi.png)
![:scale 342](/figs/devbio/seqspace/eve.png)
]
Shown above are (top) disco, kruppel, and (bottom) twist, eve

---

# Collaboration to produce web interface

.middle[
.center[
<video width="700" height="525" frameborder="0" controls autoplay>
 <source src="/figs/devbio/drosophila/webpage.mp4" type="video/mp4">
</video>
]
]

---

.center[
.vertical-center[
# Can we infer space directly from the transcriptome?
]
]

---
# Density scales as if low-dimensional manifold

.left-col50[
.center[
![:scale 500](/figs/devbio/seqspace/radius_scaling_pointcloud.svg)
]
]

---

count:false
# Density scales as if low-dimensional manifold

.left-col50[
.center[
![:scale 500](/figs/devbio/seqspace/radius_scaling_ball.svg)
]
]

.right-col50[
.center[
![:scale 550](/figs/devbio/seqspace/euclidean_ball_scaling.svg)
]
]

---
count:false
# Density scales as if low-dimensional manifold

.left-col50[
.center[
![:scale 500](/figs/devbio/seqspace/radius_scaling_neighborhood.svg)
]
]

---
count:false
# Density scales as if low-dimensional manifold

.left-col50[
.center[
![:scale 500](/figs/devbio/seqspace/radius_scaling_all_neighborhood.svg)
]
]

---
count:false
# Density scales as if low-dimensional manifold

.left-col50[
.center[
![:scale 500](/figs/devbio/seqspace/radius_scaling_shortest_path.svg)
]
]

.right-col50[
.center[
![:scale 575](/figs/devbio/seqspace/geodesic_ball_scaling.svg)
]
]

---
# Formulation of manifold inference

#### Regression for homeomorphism of manifold

.center[![:scale 900](/figs/devbio/seqspace/auto_encoder.svg)]

![:emph](Autoencoder): find an identity map that projects to low-dimensions
$$ E(W, b) = \sum\_{i,a} \left(x\_{ia} - y\_{ia}\right)^2 = \sum\_{ia} \left(x\_{i\alpha} - \phi^{-1}\left(z\_{ia}\right) \right) = \sum\_{ia} \left(x\_{i\alpha} - \phi^{-1}\left(\phi\left(x\_{i\alpha}\right)\right) \right)^2 $$

---
# Difficult intrepretation of vanilla autoencoder

.left-col50[
.center[
Sparse clusters for MNIST. Not generative
![:scale 500](/figs/devbio/seqspace/mnist_ae.png)
]
]

.left-col50[
.center[
Similar results for scRNAseq data
![:scale 550](/figs/devbio/seqspace/vanilla_ae_latent.png)
]
]

.center[Need a way to ![:emph](regularize) the learning process]

---
# Regularize by preserving topology of scRNAseq data

.left-col50[
.center[
![:scale 400](/figs/devbio/seqspace/radius_scaling_shortest_path.svg)
]
Topology hard to parameterize. Settle for distances
]

.right-col50[
#### Main idea:
* Estimate pairwise geodesic distances $D_{ij}$
* Impose latent space "isometry" $$|z\_{i} - z\_{j}| \sim D\_{ij}$$
]

.right-col50[
.center[
![:scale 400](/figs/devbio/seqspace/auto_encoder.svg)
]
$$ E(W, b, \Lambda) \sim \sum\_{i,a} \left(x\_{ia} - y\_{ia}\right)^2 + \Lambda \sum\_{i,j} \left(D\_{ij} - |\vec{z}\_i - \vec{z}\_j| \right)^2$$
]

---
# Learns canonical manifolds

.left-col50[
.center[
Swiss roll
![:scale 550](/figs/devbio/seqspace/swiss_roll.png)
]
]

.right-col50[
.center[
Autoencoder latent space
![:scale 550](/figs/devbio/seqspace/swiss_roll_learned.png)
]
]

---
# Learned manifold recapitulates scRNAseq data

Use the described loss function and architecture on Drosophila scRNAseq

.left-col50[![:scale 500](/figs/devbio/seqspace/scrna_loss.svg)]
.right-col50[![:scale 500](/figs/devbio/seqspace/scrna_genes_vs_reconstructed.png)]

.center[ Capture $\sim 80\%$ of the variances in the normalized scRNA counts ]

---
# Learned manifold recapitulates space

#### Estimated positions

.left-col50[
.center[
![:scale 500](/figs/devbio/embedding/AP_bdtnp.png)
]
]

.right-col50[
.center[
![:scale 500](/figs/devbio/embedding/DV_bdtnp.png)
]
]

--
AP and DV positions fall along diagonals of 2D square.

--
Investigation into the third dimension ongoing.

---
# UMAP does not

.left-col50[
.center[
AP axis
![:scale 500](/figs/devbio/seqspace/umap_drosophila_ap.png)
]
]

.left-col50[
.center[
DV axis
![:scale 500](/figs/devbio/seqspace/umap_drosophila_dv.png)
]
]

---

.center[
.vertical-center[
# Conceptual Outlook
]
]

---
count:false
# Many-body formulation of positional information

.left-col50[
.center[
![:scale 450](/figs/devbio/seqspace/spatial_vs_expression_geodesic.svg)
]
]

---
count:false

# Many-body formulation of positional information

.left-col50[
.center[
Isometric embedding
![:scale 350](/figs/devbio/seqspace/butterfly_bdntp.png)
]
]

.right-col50[
.center[
![:scale 560](/figs/devbio/seqspace/expression_to_space_jacobian_cartoon.svg)
]
]

--
.left[
#### Takeaways
]

Previous attempt1 assumed isometry between space and expression
.footnote[
1.cite[Nitzen M. et al. Gene expression cartography Nature 2019]
]

--
Jacobian ![:emph](captures) positional information of whole ![:emph](transcriptome)