presentation

Nicholas Noll
Kavli Institute for Theoretical Physics, Santa Barbara, USA
]
]

---

# Quick recap of the overall question

.left-col50[
.center[
L. Wolpert. Positional Information and Pattern Formation in Development
![:scale 500](/figs/devbio/seqspace/wolpert_positional_information.png)
]
]

.right-col50[
.center[
![:scale 500](/figs/devbio/seqspace/bialek_positional_info_header.png)
![:scale 450](/figs/devbio/seqspace/bialek_positional_info_fig.png)
]
* Positional information is constant 
* Determined by gap genes
]

---

# Reinterpretation as a many-body encoding?

scRNAseq technology allows high-throughput analyses

&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;

---

# Quick recap of the data

---

count:false
# Quick recap of the data

---

count:false
# Can estimate the pattern of any gene
.center[
![:scale 350](/figs/devbio/seqspace/disco.png)
![:scale 350](/figs/devbio/seqspace/Kr.png)
![:scale 350](/figs/devbio/seqspace/twi.png)
![:scale 350](/figs/devbio/seqspace/eve.png)
]
Shown above are (top) disco, kruppel, and (bottom) twist, eve

---

# Can we discover the spatial mapping using the just cell expression?

---

# Amenable to linear dimensional reduction?

---

count:false
# Amenable to linear dimensional reduction?

.right-col50[
.center[
![:scale 500](/figs/devbio/seqspace/participation_ratio_subsample_scaling.svg)
]
]

* $ \sim 25-50 $ relevant linear hyperplanes
* nonlocal: each hyperplane involves $\sim 10^2 - 10^3$ genes

---

# Simple scaling analysis in this space

---

count:false
# Simple scaling analysis in this space

---

count:false
# Simple scaling analysis in this space

---

count:false
# Simple scaling analysis in this space

---

count:false
# Simple scaling analysis in this space

---

# How to find representation of low-dimensional manifold?

* Problem: Given a set of pairwise distances $D^2_{ij}$ between points, how to estimate low-dimensional embedding?
* Solution: Formulate as an optimization problem.

Classical multidimensional scaling minimizes the energy with respect to $z$:

$$ E = \frac { \sum\_{i,j}(B\_{ij} - \sum\_a z\_{ai} z\_{aj})^2} {\sum\_{i,j} B\_{ij}^2} $$

where $B_{ij}$ is the centered distance matrix

$$ B_{ij} \equiv -\frac 1 2 \left[ I - 1/n\right] D^2 \left[I - 1/n\right] $$

---

Why the centered distance matrix?
* Goal is to find a low-dimensional embedding where euclidean pairwise distances reproduce geodesics.
--
* Not unique: if $\hat{z}$ is solution then $\hat{z} + c$ is also solution
--

Steps to see:
* Choose the centered configuration $\sum\_i z_{ai} = 0 \, \forall a$
--
* $b\_{ij} \equiv \sum\_{a} z\_{ai} z\_{aj} \implies d^2\_{ij} = b\_{ii} + b\_{jj} - 2b\_{ij}$
--
* $\sum\_{i}b\_{ij} = \sum\_{i,a} z\_{ai} z\_{aj} = \sum\_{i,a} z\_{ai} z\_{aj} = 0 $
--
* $\sum\_{i}d^2\_{ij} = \sum\_i b\_{ii} + N b\_{jj} = Tr[b] + N b\_{jj} $
--

All together imply $B = -1/2\left[ I - 1/n\right] D^2\left[ I - 1/n\right] $

---

Classical multidimensional scaling minimizes the energy with respect to $z$:

$$ E = \frac { \sum\_{i,j}(B\_{ij} - \sum\_a z\_{ai} z\_{aj})^2} {\sum\_{i,j} B\_{ij}^2} $$

As written, this is just an eigenvalue decomposition problem. 
Diagonalize $B$ and keep only the desired number of eigenvectors.

Isomap algorithm. Given set of points $x\_{gi}$
1. Estimate geodesics $D^2$
2. Compute centered distance matrix $B$
3. Perform classical MDS

---

# Discovers meaningful latent space

.left-col50[
.center[
![:scale 525](/figs/devbio/seqspace/distance_correlation_vs_isomap_dimension.svg)
]
]

.right-col50[
.center[
![:scale 525](/figs/devbio/seqspace/distance_isomap_correlation_eg_d3.svg)
]
]

While we can improve beyond 3 dimensions, 3 appears to be the knee of diminishing returns.

---

Compare embedding with average predicted position on the embryo.

Strong spatial signal discovered from just expression pairwise distances!

---

.right-col50[
.center[
![:scale 550](/figs/devbio/seqspace/expression_to_space_jacobian_cartoon.svg)
]
]

&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;

#### Important takeaway

* Expression is not isometric to embryo space.
* Counter to Nitzen M. et al. Gene expression cartography Nature 2019.

---

class: middle, center
# Problem: solution is not generalizable
We have no interpolation.

How to deal with new data points?

How to compute the Jacobian, i.e. positional information?

---

class: middle, center
# Reformulate question: 
How to find ![:emph](representation) of low-dimensional manifold $\rightarrow$ how to find ![:emph](a map) to manifold.

---

# Quick auto-encoder tutorial
Falls under the genre of "unsupervised learning".
Want homeomorphism of manifold

Feedforward architecture ($f$ is non-linear function):
$$ x^{\ell}\_{a} = f(W^{\ell}\_{ab} x^{\ell-1}\_{b} + b^\ell\_a)$$
Unclear what objective function to write down to obtain weights?

---

count: false
# Quick auto-encoder tutorial
Falls under the genre of "unsupervised learning".
Want homeomorphism of manifold

Duplicate and reflect the network. Try to find an identity map
$$ E(W, b) = \sum\_{i,a} (x\_{ia} - y\_{ia})^2$$

---

# Intuition on simpler data set
Can we discover pullback/pushforward for the Drosophila gut?

To make it harder, embed in 50 dimensions, add noise.

---

# Obvious problem: overfitting

Train on subset of data. Validate with the remainder.

.left-col50[![:scale 500](/figs/devbio/seqspace/gut_pointcloud.svg)]
.right-col50[![:scale 500](/figs/devbio/seqspace/validation_overfitting_eg_gut.svg)]

.right-col50[
#### Regularize by constraining map?
* Weight directions by relevance
* Constrain latent space $z$
]

---

# Regularize by constraining 'learned' homeomorphism

Input basis as principal components. Weight by principal value

$$ E(W, b) = \sum\_{i,a} \lambda\_{a} (x\_{ia} - y\_{ia})^2$$

.left-col50[![:scale 500](/figs/devbio/seqspace/gut_pointcloud.svg)]
.right-col50[![:scale 500](/figs/devbio/seqspace/validation_weight_singular_value_eg_gut.svg)]

---

Input basis as principal components. Weight by principal value

$$ E(W, b) = \sum\_{i,a} \lambda\_{a} (x\_{ia} - y\_{ia})^2$$

.left-col50[.center[$z$ ![:scale 500](/figs/devbio/seqspace/latent_weight_singular_value_eg_gut.gif)]]
.right-col50[.center[$y$ ![:scale 500](/figs/devbio/seqspace/reconstructed_weight_singular_value_eg_gut.gif)]]

---

Constrain neighborhood distances in latent space to reproduce neighborhood distances from input.

$$ E(W, b) = \sum\_{i,a} \lambda\_{a} (x\_{ia} - y\_{ia})^2 + \sum\_{i} \sum\_{j \in N\_i} (D^{(x)}\_{ij} - D^{(z)}\_{ij})^2 $$
.left-col50[![:scale 500](/figs/devbio/seqspace/gut_pointcloud.svg)]
.right-col50[![:scale 500](/figs/devbio/seqspace/validation_neighborhood_isometry_eg_gut.svg)]

---

Constrain neighborhood distances in latent space to reproduce neighborhood distances from input.

$$ E(W, b) = \sum\_{i,a} \lambda\_{a} (x\_{ia} - y\_{ia})^2 + \sum\_{i} \sum\_{j \in N\_i} (D^{(x)}\_{ij} - D^{(z)}\_{ij})^2 $$
.left-col50[![:scale 500](/figs/devbio/seqspace/latent_weight_neighborhood_isometry_eg_gut.gif)]
.right-col50[![:scale 500](/figs/devbio/seqspace/reconstructed_neighborhood_isometry_eg_gut.gif)]

---

![:emph](Important:) Depth of network controls the ability to "close" the manifold

.left-col50[![:scale 500](/figs/devbio/seqspace/reconstructed_deep_eg_gut.gif)]
.right-col50[![:scale 500](/figs/devbio/seqspace/gut_seam_close_up.png)]

---

# Learning the map for expression space

Use the same loss function and similar architecture to learn scRNAseq manifold.

.left-col50[![:scale 500](/figs/devbio/seqspace/scrna_loss.svg)]
.right-col50[![:scale 500](/figs/devbio/seqspace/scrna_genes_vs_reconstructed.png)]

#### Preliminary takeaways (still ongoing):
* Map data $10^4 \rightarrow 3$ coordinates per sequenced cell.
* Capture $83\%$ of variance

---

# Latent space representation for expression space

Colored by estimated AP/DV position (go to external pages)

---

# Next steps

* Run on variations of the underlying architecture to make sure we aren't too dependent on it.
* Extend to other developing systems