.center[ .vertical-center[ # SeqSpace: Geometry of expression in early Drosophila embryogenesis Nicholas Noll Kavli Institute for Theoretical Physics, Santa Barbara, USA ] ] --- # Quick recap of the overall question .left-col50[ .center[ L. Wolpert. Positional Information and Pattern Formation in Development ![:scale 500](/figs/devbio/seqspace/wolpert_positional_information.png) ] ] .right-col50[ .center[ ![:scale 500](/figs/devbio/seqspace/bialek_positional_info_header.png) ![:scale 450](/figs/devbio/seqspace/bialek_positional_info_fig.png) ] * Positional information is constant * Determined by gap genes ] --- # Reinterpretation as a many-body encoding? scRNAseq technology allows high-throughput analyses .center[ ![:scale 650](/figs/devbio/seqspace/drosophila_single_cell_resolution_header.png) ] --- # Quick recap of the data .center[ ![:scale 1000](/figs/devbio/seqspace/data_overview.svg) ] --- count:false # Quick recap of the data .center[ ![:scale 1000](/figs/devbio/seqspace/data_mapping.svg) ] --- count:false # Can estimate the pattern of any gene .center[ ![:scale 350](/figs/devbio/seqspace/disco.png) ![:scale 350](/figs/devbio/seqspace/Kr.png) ![:scale 350](/figs/devbio/seqspace/twi.png) ![:scale 350](/figs/devbio/seqspace/eve.png) ] Shown above are (top) disco, kruppel, and (bottom) twist, eve --- class: center, middle # Can we discover the spatial mapping using the just cell expression? --- # Amenable to linear dimensional reduction? .left-col50[ .center[ ![:scale 500](/figs/devbio/seqspace/principal_values.svg) ] ] -- .right-col50[ .center[ ![:scale 500](/figs/devbio/seqspace/participation_ratio.svg) ] ] --- count:false # Amenable to linear dimensional reduction? .left-col50[ .center[ ![:scale 500](/figs/devbio/seqspace/principal_values.svg) ] ] .right-col50[ .center[ ![:scale 500](/figs/devbio/seqspace/participation_ratio_subsample_scaling.svg) ] ] * $ \sim 25-50 $ relevant linear hyperplanes * nonlocal: each hyperplane involves $\sim 10^2 - 10^3$ genes --- # Simple scaling analysis in this space .left-col50[ .center[ ![:scale 500](/figs/devbio/seqspace/radius_scaling_pointcloud.svg) ] ] --- count:false # Simple scaling analysis in this space .left-col50[ .center[ ![:scale 500](/figs/devbio/seqspace/radius_scaling_ball.svg) ] ] -- .right-col50[ .center[ ![:scale 550](/figs/devbio/seqspace/euclidean_ball_scaling.svg) ] ] --- count:false # Simple scaling analysis in this space .left-col50[ .center[ ![:scale 500](/figs/devbio/seqspace/radius_scaling_neighborhood.svg) ] ] --- count:false # Simple scaling analysis in this space .left-col50[ .center[ ![:scale 500](/figs/devbio/seqspace/radius_scaling_all_neighborhood.svg) ] ] --- count:false # Simple scaling analysis in this space .left-col50[ .center[ ![:scale 500](/figs/devbio/seqspace/radius_scaling_shortest_path.svg) ] ] -- .right-col50[ .center[ ![:scale 575](/figs/devbio/seqspace/geodesic_ball_scaling.svg) ] ] --- # How to find representation of low-dimensional manifold? * Problem: Given a set of pairwise distances $D^2_{ij}$ between points, how to estimate low-dimensional embedding? * Solution: Formulate as an optimization problem. -- Classical multidimensional scaling minimizes the energy with respect to $z$: $$ E = \frac { \sum\_{i,j}(B\_{ij} - \sum\_a z\_{ai} z\_{aj})^2} {\sum\_{i,j} B\_{ij}^2} $$ where $B_{ij}$ is the centered distance matrix $$ B_{ij} \equiv -\frac 1 2 \left[ I - 1/n\right] D^2 \left[I - 1/n\right] $$ --- count: false # How to find representation of low-dimensional manifold? Why the centered distance matrix? * Goal is to find a low-dimensional embedding where euclidean pairwise distances reproduce geodesics. -- * Not unique: if $\hat{z}$ is solution then $\hat{z} + c$ is also solution -- Steps to see: * Choose the centered configuration $\sum\_i z_{ai} = 0 \, \forall a$ -- * $b\_{ij} \equiv \sum\_{a} z\_{ai} z\_{aj} \implies d^2\_{ij} = b\_{ii} + b\_{jj} - 2b\_{ij}$ -- * $\sum\_{i}b\_{ij} = \sum\_{i,a} z\_{ai} z\_{aj} = \sum\_{i,a} z\_{ai} z\_{aj} = 0 $ -- * $\sum\_{i}d^2\_{ij} = \sum\_i b\_{ii} + N b\_{jj} = Tr[b] + N b\_{jj} $ -- All together imply $B = -1/2\left[ I - 1/n\right] D^2\left[ I - 1/n\right] $ --- count: false # How to find representation of low-dimensional manifold? Classical multidimensional scaling minimizes the energy with respect to $z$: $$ E = \frac { \sum\_{i,j}(B\_{ij} - \sum\_a z\_{ai} z\_{aj})^2} {\sum\_{i,j} B\_{ij}^2} $$ -- As written, this is just an eigenvalue decomposition problem. Diagonalize $B$ and keep only the desired number of eigenvectors. -- Isomap algorithm. Given set of points $x\_{gi}$ 1. Estimate geodesics $D^2$ 2. Compute centered distance matrix $B$ 3. Perform classical MDS --- # Discovers meaningful latent space .left-col50[ .center[ ![:scale 525](/figs/devbio/seqspace/distance_correlation_vs_isomap_dimension.svg) ] ] .right-col50[ .center[ ![:scale 525](/figs/devbio/seqspace/distance_isomap_correlation_eg_d3.svg) ] ] While we can improve beyond 3 dimensions, 3 appears to be the knee of diminishing returns. --- count: false # Discovers meaningful latent space Compare embedding with average predicted position on the embryo. .left-col50[ .center[ ![:scale 525](/figs/devbio/seqspace/isomap_3d_AP.svg) ] ] .right-col50[ .center[ ![:scale 525](/figs/devbio/seqspace/isomap_3d_DV.svg) ] ] Strong spatial signal discovered from just expression pairwise distances! --- count: false # Discovers meaningful latent space .left-col50[ .center[ ![:scale 525](/figs/devbio/seqspace/spatial_vs_expression_geodesic.svg) ] ] -- .right-col50[ .center[ ![:scale 550](/figs/devbio/seqspace/expression_to_space_jacobian_cartoon.svg) ] ] #### Important takeaway * Expression is not isometric to embryo space. * Counter to Nitzen M. et al. Gene expression cartography Nature 2019. --- class: middle, center # Problem: solution is not generalizable We have no interpolation. How to deal with new data points? How to compute the Jacobian, i.e. positional information? --- class: middle, center # Reformulate question: How to find ![:emph](representation) of low-dimensional manifold $\rightarrow$ how to find ![:emph](a map) to manifold. --- # Quick auto-encoder tutorial Falls under the genre of "unsupervised learning". Want homeomorphism of manifold .center[![:scale 900](/figs/devbio/seqspace/encoder.svg)] -- Feedforward architecture ($f$ is non-linear function): $$ x^{\ell}\_{a} = f(W^{\ell}\_{ab} x^{\ell-1}\_{b} + b^\ell\_a)$$ Unclear what objective function to write down to obtain weights? --- count: false # Quick auto-encoder tutorial Falls under the genre of "unsupervised learning". Want homeomorphism of manifold .center[![:scale 900](/figs/devbio/seqspace/auto_encoder.svg)] Duplicate and reflect the network. Try to find an identity map $$ E(W, b) = \sum\_{i,a} (x\_{ia} - y\_{ia})^2$$ --- # Intuition on simpler data set Can we discover pullback/pushforward for the Drosophila gut? .center[![:scale 600](/figs/devbio/seqspace/gut_pointcloud.svg)] To make it harder, embed in 50 dimensions, add noise. --- # Obvious problem: overfitting Train on subset of data. Validate with the remainder. .left-col50[![:scale 500](/figs/devbio/seqspace/gut_pointcloud.svg)] .right-col50[![:scale 500](/figs/devbio/seqspace/validation_overfitting_eg_gut.svg)] -- .left-col50[ #### Canonical computer science solutions * Dropout * Batch normalization ] .right-col50[ #### Regularize by constraining map? * Weight directions by relevance * Constrain latent space $z$ ] --- # Regularize by constraining 'learned' homeomorphism Input basis as principal components. Weight by principal value $$ E(W, b) = \sum\_{i,a} \lambda\_{a} (x\_{ia} - y\_{ia})^2$$ .left-col50[![:scale 500](/figs/devbio/seqspace/gut_pointcloud.svg)] .right-col50[![:scale 500](/figs/devbio/seqspace/validation_weight_singular_value_eg_gut.svg)] --- count: false # Regularize by constraining 'learned' homeomorphism Input basis as principal components. Weight by principal value $$ E(W, b) = \sum\_{i,a} \lambda\_{a} (x\_{ia} - y\_{ia})^2$$ .left-col50[.center[$z$ ![:scale 500](/figs/devbio/seqspace/latent_weight_singular_value_eg_gut.gif)]] .right-col50[.center[$y$ ![:scale 500](/figs/devbio/seqspace/reconstructed_weight_singular_value_eg_gut.gif)]] --- count: false # Regularize by constraining 'learned' homeomorphism Constrain neighborhood distances in latent space to reproduce neighborhood distances from input. $$ E(W, b) = \sum\_{i,a} \lambda\_{a} (x\_{ia} - y\_{ia})^2 + \sum\_{i} \sum\_{j \in N\_i} (D^{(x)}\_{ij} - D^{(z)}\_{ij})^2 $$ .left-col50[![:scale 500](/figs/devbio/seqspace/gut_pointcloud.svg)] .right-col50[![:scale 500](/figs/devbio/seqspace/validation_neighborhood_isometry_eg_gut.svg)] --- count: false # Regularize by constraining 'learned' homeomorphism Constrain neighborhood distances in latent space to reproduce neighborhood distances from input. $$ E(W, b) = \sum\_{i,a} \lambda\_{a} (x\_{ia} - y\_{ia})^2 + \sum\_{i} \sum\_{j \in N\_i} (D^{(x)}\_{ij} - D^{(z)}\_{ij})^2 $$ .left-col50[![:scale 500](/figs/devbio/seqspace/latent_weight_neighborhood_isometry_eg_gut.gif)] .right-col50[![:scale 500](/figs/devbio/seqspace/reconstructed_neighborhood_isometry_eg_gut.gif)] --- count: false # Regularize by constraining 'learned' homeomorphism ![:emph](Important:) Depth of network controls the ability to "close" the manifold .left-col50[![:scale 500](/figs/devbio/seqspace/reconstructed_deep_eg_gut.gif)] .right-col50[![:scale 500](/figs/devbio/seqspace/gut_seam_close_up.png)] --- # Learning the map for expression space Use the same loss function and similar architecture to learn scRNAseq manifold. .left-col50[![:scale 500](/figs/devbio/seqspace/scrna_loss.svg)] .right-col50[![:scale 500](/figs/devbio/seqspace/scrna_genes_vs_reconstructed.png)] -- #### Preliminary takeaways (still ongoing): * Map data $10^4 \rightarrow 3$ coordinates per sequenced cell. * Capture $83\%$ of variance --- # Latent space representation for expression space Colored by estimated AP/DV position (go to external pages) --- # Next steps * Run on variations of the underlying architecture to make sure we aren't too dependent on it. * Extend to other developing systems