This is a quick walkthrough demonstrating how to generate SWNE plots alongside the Seurat pipeline using a 3k PBMC dataset as an example.

To save time we will be using the pre-computed Seurat object pbmc3k_seurat.Robj, which can be downloaded here.

First let’s load the required libraries

library(Seurat)
library(swne)

Next let’s load the Seurat object

obj <- readRDS("~/swne/Data/pbmc3k_final.RObj")

Most scRNA-seq pipelines only use a subset of highly overdispersed genes for analysis. We’ll pull out those variable genes here, as well as the cluster labels.

## Pull out overdispersed genes as defined by Seurat
var.genes <- VariableFeatures(obj)
length(var.genes)
## [1] 2000
## Pull out cell clusters as defined by Seurat
cell.clusters <- Idents(obj)
levels(cell.clusters)
## [1] "Naive CD4 T"  "Memory CD4 T" "CD14+ Mono"   "B"           
## [5] "CD8 T"        "FCGR3A+ Mono" "NK"           "DC"          
## [9] "Mk"

The easiest way to generate an SWNE embedding is to use the wrapper function RunSWNE

## Run SWNE
genes.embed <- c("MS4A1", "GNLY", "CD3E", "CD14",
                 "FCER1A", "FCGR3A", "LYZ", "PPBP", "CD8A")
swne.embedding <- RunSWNE(obj, k = 16, genes.embed = genes.embed, sample.groups = cell.clusters)
## [1] "2000 variable genes to use"
## Initial stress        : 0.12346
## stress after  10 iters: 0.03836, magic = 0.500
## stress after  20 iters: 0.03631, magic = 0.500
## stress after  30 iters: 0.03571, magic = 0.500
## stress after  40 iters: 0.03570, magic = 0.500
## Plot SWNE
PlotSWNE(swne.embedding, alpha.plot = 0.4, sample.groups = cell.clusters,
         do.label = T, label.size = 3.5, pt.size = 1, show.legend = F,
         seed = 42)

RunSWNE can also return a Seurat object with SWNE set as a dimensional reduction

## Run SWNE
obj <- RunSWNE(obj, k = 16, genes.embed = genes.embed,
               return.format = "seurat")
## [1] "2000 variable genes to use"
## Initial stress        : 0.12403
## stress after  10 iters: 0.05757, magic = 0.092
## stress after  20 iters: 0.03781, magic = 0.500
## stress after  30 iters: 0.03656, magic = 0.500
## stress after  40 iters: 0.03645, magic = 0.500
## stress after  50 iters: 0.03644, magic = 0.500
DimPlot(obj, reduction = "swne")

Now we’ll go through the SWNE embedding process step by step to get a sense of how everything works

First, let’s pull out the counts, scale and adjust gene variance, while keeping the scaled matrix nonnegative.

norm.counts <- ExtractNormCounts(obj, obj.type = "seurat", rescale = T, rescale.method = "log")
## calculating variance fit ... using gam
## Warning in pf(exp(df$res), n.obs, n.obs, lower.tail = F, log.p = F): NaNs
## produced
dim(norm.counts)
## [1] 13714  2638

We use the FindNumFactors function to identify the optimal number of factors to use. This function can be slow for large datasets, since it iterates over different values of k, so a simple “hack” is to just set k equal to the number of significant principal components.

k.range <- seq(2,16,2) ## Range of factors to iterate over
k.err <- FindNumFactors(norm.counts[var.genes,], k.range = k.range, n.cores = 8, do.plot = T)

We then run the NMF decomposition. We can initialize the NMF using either Independent Component Analysis (ica), Nonnegative SVD (nnsvd), or a completely random initialization. ICA is recommended for most datasets. The output of RunNMF is a list of the gene loadings (W) and NMF embedding (H).

k <- 16
nmf.res <- RunNMF(norm.counts[var.genes,], k = k)

We can either use the pre-computed Shared Nearest Neighbors (SNN) matrix from Seurat or re-compute it ourselves.

# pc.scores <- t(GetCellEmbeddings(se.obj, reduction.type = "pca", dims.use = 1:k))
# snn <- CalcSNN(pc.scores)
obj <- FindNeighbors(obj, k = 10, prune.SNN = 1/15)
## Computing nearest neighbor graph
## Computing SNN
snn <- as(obj@graphs$RNA_snn, "dgCMatrix")

We can prune the SNN matrix by removing edges between cells in clusters that don’t share a statistically significant amount of edges using a PAGA graph (https://github.com/theislab/paga)

knn <- as(obj@graphs$RNA_nn, "dgCMatrix") ## Extract kNN matrix
snn <- PruneSNN(snn, knn, clusters = cell.clusters, qval.cutoff = 1e-3)

Run the SWNE embedding. The three key parameters are alpha.exp, snn.exp, and n_pull, which control how the factors and neighboring cells affect the cell coordinates.

alpha.exp <- 1.25 # Increase this > 1.0 to move the cells closer to the factors. Values > 2 start to distort the data.
snn.exp <- 0.25 # Lower this < 1.0 to move similar cells closer to each other
n_pull <- 3 # The number of factors pulling on each cell. Must be at least 3.
swne.embedding <- EmbedSWNE(nmf.res$H, SNN = snn, alpha.exp = alpha.exp, snn.exp = snn.exp,
                            n_pull = n_pull)
## Initial stress        : 0.15145
## stress after  10 iters: 0.04650, magic = 0.500
## stress after  20 iters: 0.04597, magic = 0.500
## stress after  30 iters: 0.04594, magic = 0.500

For now, let’s hide the factors by setting their names to the empty string "". We’ll interpret them later

swne.embedding$H.coords$name <- ""

To help with interpreting these cell clusters, let’s pick some key PBMC genes to embed.

genes.embed <- c("MS4A1", "GNLY", "CD3E", "CD14",
                 "FCER1A", "FCGR3A", "LYZ", "PPBP", "CD8A")

Since we only ran NMF on the overdispersed genes, we need to project the rest of the genes onto the NMF projection to get gene loadings for all genes.

nmf.res$W <- ProjectFeatures(norm.counts, nmf.res$H, n.cores = 8)

Now we can embed the key PBMC genes onto the visualization and remake the plot

swne.embedding <- EmbedFeatures(swne.embedding, nmf.res$W, genes.embed, n_pull = n_pull)

Let’s make the SWNE plot with the key genes embedded. The closer a cell or a cluster is to a gene, the higher the expression level. We set a seed for reproducible cluster colors, so that every plot will use the same colors to label the clusters.

color.seed <- 42
PlotSWNE(swne.embedding, alpha.plot = 0.4, sample.groups = cell.clusters, do.label = T,
         label.size = 3.5, pt.size = 1, show.legend = F, seed = color.seed)

We can validate the embedded genes by overlaying the expression of one of these key genes onto the plot.

gene.use <- "CD8A"
gene.expr <- norm.counts[gene.use,]
FeaturePlotSWNE(swne.embedding, gene.expr, gene.use, alpha.plot = 0.4, label.size = 3.5, pt.size = 1.25)

We can also make a t-SNE plot for comparison.

obj <- RunTSNE(obj)
tsne.scores <- Embeddings(obj, "tsne")
PlotDims(tsne.scores, sample.groups = cell.clusters, pt.size = 1, label.size = 3.5, alpha = 0.4,
         show.legend = F, seed = color.seed, show.axes = F)

We can also interpret the factors by using the gene loadings matrix. Here, we extract the top 3 genes for each factor by gene loading. Since NMF tends to create a parts-based representation of the data, the factors often correspond to key biological processes or gene modules that explain the data.

gene.loadings <- nmf.res$W
top.factor.genes.df <- SummarizeAssocFeatures(gene.loadings, features.return = 3)
head(top.factor.genes.df)
##   assoc_score feature   factor
## 1   0.7571037     LTB factor_1
## 2   0.5724399    FTH1 factor_1
## 3   0.5584401    IL32 factor_1
## 4   0.5369279    CD74 factor_2
## 5   0.4849361 HLA-DRA factor_2
## 6   0.4363460    FTH1 factor_2

And finally, we can make a heatmap to visualize the top factors for each gene

gene.loadings.heat <- gene.loadings[unique(top.factor.genes.df$feature),]
ggHeat(gene.loadings.heat, clustering = "col")

Finally, we can extract cluster colors for compatibility with other plotting methods (i.e. Monocle)

color.mapping <- ExtractSWNEColors(swne.embedding, sample.groups = cell.clusters, seed = color.seed)
color.mapping
## Memory CD4 T            B   CD14+ Mono           NK        CD8 T 
##    "#00C19F"    "#00B9E3"    "#DB72FB"    "#00BA38"    "#D39200" 
##  Naive CD4 T FCGR3A+ Mono           DC           Mk 
##    "#F8766D"    "#FF61C3"    "#619CFF"    "#93AA00"