perturbLM is a lightweight package for assessing the effects of perturbations on single cell omics datasets. Here, we’ll demonstrate fitting a regularized linear model to a dataset measuring the single cell RNA-seq response to transcription factor overexpression in H1 stem cells.
First download the example dataset from figshare here. The study that generated this dataset is linked here. The dataset is stored as a Seurat object
Next load the necessary libraries
library(perturbLM)
library(Seurat)
Registered S3 method overwritten by 'htmlwidgets':
method from
print.htmlwidget tools:rstudio
Attaching SeuratObject
Attaching sp
library(Matrix)
library(glmnet)
Loaded glmnet 4.1-4
Read in the dataset and pull out the variable genes
obj <- readRDS("/share/ywu/perturbation-analysis/processed_data/parekh.rds")
var.genes <- intersect(obj@assays$RNA@var.features, rownames(obj))
obj
An object of class Seurat
33694 features across 3329 samples within 1 assay
Active assay: RNA (33694 features, 4000 variable features)
perturbLM offers a wrapper functionm RunLinearModel
, that 1) sets up the linear model inputs and outputs, 2) optimizes model hyperparameters, 3) runs the linear model, and 4) evaluates the linear model and assesses which perturbations are likely to have significant effects. RunLinearModel
automatically plots the hyperparameter optimization
model.results <- RunLinearModel(obj,
pert.col = "condition",
batch.col = "batch",
size.factor.col = "nCount_RNA",
mito.col = "percent.mt",
features.use = var.genes)
[1] "Assuming response is normalized expression data "
Loading required package: caret
Loading required package: ggplot2
Loading required package: lattice
Loading required package: snow
[1] "Best alpha: 1"
[1] "Best lambda: 0.05725640732971"
The RunLinearModel
wrapper function returns a list including the model design matrix (x), response (y) which in this case is the normalized expression matrix, the ElasticNet model (model) and coefficients (coefs), and the model evaluation
names(model.results)
[1] "x" "y" "perts" "train.cells" "test.cells" "family" "cv" "model" "coefs" "evaluation"
The model coefficients represent the effect of each perturbation on gene expression
print(dim(model.results$coefs))
[1] 4000 55
print(model.results$coefs[1:5, 1:5])
ASCL1 ASCL3 ASCL4 ASCL5 ATF7
ACTC1 -0.050788886 0.03277820 0.010251644 0 0
LINC00458 0.019186178 -0.03264142 -0.029962373 0 0
NTS 0.194006751 -0.04438682 0.033525631 0 0
ACTA1 -0.031709582 0.02361890 -0.035877048 0 0
HAND1 -0.002172323 0.00314431 -0.006350222 0 0
The evaluation dataframe summarizes which perturbations the model was able to effectively predict by comparing the full model with a reduced model that has no perturbation information.
head(model.results$evaluation)
The rel_performance
metric describes the relative performance of the full vs reduced model for each perturbation. In this case, it’s specifically the log2 transform of the ratio of the full model pearson correlation with the reduced model pearson correlation. You can modify the evaluation metric in RunLinearModel
by specifying the eval.metric
parameter. Perturbations with a high rel_performance
(>>1) are more likely to have a significant effect on expression.
We can pull out all the perturbations with rel_performance > 2
.
sig.perts <- subset(model.results$evaluation, rel_performance > 2)$group
sig.perts
[1] "CRX" "ASCL1" "SNAI2" "OTX2" "MYC" "MESP1" "ERG" "KLF4" "CDX2"
And find the top upregulated genes (by coefficient magnitude) for each perturbation
ngenes <- 4
top.sig.genes <- lapply(sig.perts, function(i) {
top.up <- head(names(sort(model.results$coefs[,i], decreasing = T)), n = ngenes)
# top.dn <- head(names(sort(model.results$coefs[,i], decreasing = F)), n = ngenes)
# c(top.up, top.dn)
top.up
})
top.sig.genes <- unique(unlist(top.sig.genes, F, F))
length(top.sig.genes)
[1] 33
We can create a new Seurat object with the model coefficients for visualization
pert.coefs <- model.results$coefs[,colnames(model.results$coefs) %in% levels(as.factor(obj$condition))]
coef.obj <- CreateSeuratObject(pert.coefs)
coef.obj <- ScaleData(coef.obj, verbose = F)
coef.obj$pert <- as.factor(colnames(coef.obj))
coef.obj
An object of class Seurat
4000 features across 51 samples within 1 assay
Active assay: RNA (4000 features, 0 variable features)
We can plot the effects of the top perturbations
coef.obj.top <- coef.obj[,sig.perts]
coef.obj.top$pert <- droplevels(coef.obj.top$pert)
levels(coef.obj.top$pert) <- sig.perts
DoHeatmap(coef.obj.top, features = top.sig.genes, group.by = "pert", size = 4, disp.max = 5, draw.lines = F) +
scale_fill_gradient2(low = "skyblue", mid = "white", high = "tomato", midpoint = 0) +
labs(fill='Scaled Coefficient')
Scale for 'fill' is already present. Adding another scale for 'fill', which will replace the existing scale.
We can also embed the perturbations using PCA
coef.obj <- RunPCA(coef.obj, features = rownames(coef.obj))
Warning in irlba(A = t(x = object), nv = npcs, ...) :
You're computing too large a percentage of total singular values, use a standard svd instead.
PC_ 1
Positive: RP11-598D14.1, GPRC5A, S100A14, LINC01198, T, MAL, NLRP7, SLITRK2, LIM2, RP1-170O19.17
SYNE1, FGF17, RP11-402G3.3, AC020571.3, KRT23, ATP6V0D2, CH17-360D5.3, TLX3, CEACAM6, FXYD4
INSM2, WNT7A, RP11-189B4.7, SPANXN3, HSBP1L1, RP11-180O5.2, WNT11, ZFY-AS1, TNFSF9, HPGD
Negative: SOX2, TERF1, CSRP2, BUB3, CD24, MCM5, PTPRS, L1TD1, OLFM2, THY1
PSIP1, SLC1A5, ARG2, EEF1E1, GINS2, SMS, AP1S2, PIM2, TRIAP1, ACP1
SALL2, GAPDH, SEPHS1, AKR1B1, CDCA7L, FOXD3-AS1, PIF1, PAICS, TMEM126B, PHC1
PC_ 2
Positive: HOXC9, CRYM, HOXC8, HOXC-AS1, HOXA9, HOXD4, HOXB3, HOXA3, HOXC6, MR1
NKX2-8, CCL3, HOXC5, HOXC4, HOXB9, ALOXE3, MIR7-3HG, PYDC1, HOXB7, HOXA-AS3
HOXB-AS3, HOXB8, TGFBI, LGALS8, CA10, HRH2, GFRA2, HOXD1, HOXA7, LY6H
Negative: SYT1, DDX39A, BICD1, IQGAP2, TUNAR, RP11-490M8.1, TDGF1, TXNRD2, NBR1, SYTL1
C6orf48, POU5F1B, PIDD1, CDYL, SOCS1, GPN1, RACGAP1, NIFK, FOXP1, KIF18B
RP11-267L5.1, RAPGEF1, PDLIM1, SNHG12, CYP2S1, RPL12, TUBB2A, TRDN, HIST1H4E, VWA1
PC_ 3
Positive: CKB, CLDN7, CDH1, PPA1, CRB3, UACA, ECHDC2, COL18A1, APOBEC3C, C6orf132
NOTCH3, SVIP, MAL2, CYSTM1, PPP1R35, BAIAP2L2, WRAP73, TRAP1, DNAJB11, SLC9A3R1
C12orf75, ODC1, MAPK13, PSMA3, SLC7A1, SAPCD2, TFB1M, CAST, TMEM125, PGPEP1
Negative: RP11-494O16.4, PGM5P2, DKK2, IGF1, FABP1, CD22, CTB-180C19.1, RP11-703H8.7, HIST1H2BG, PTGS2
CTC-286N12.1, PLEKHO1, TDO2, GSC, VIM, CNTFR, EDNRB, FOXI2, ICAM2, FAM43B
ST8SIA4, CKS1B, SPARCL1, ZNF143, TINAGL1, RRAS, LINC01480, CACNG8, SUGCT, EFEMP1
PC_ 4
Positive: CUZD1, ID4, DLL1, ID2, TFF3, LFNG, NFASC, PCAT7, RP11-451G4.2, ID1
ANXA5, MYLPF, MIAT, LAMP5, MSX2, CCKBR, VIL1, SPIB, ABHD8, LOXL1
SHD, LHFP, NPTX1, LDHB, HES1, NEXN, SHISA2, RBP1, RP11-180C16.1, METRNL
Negative: TCEA1, NDUFAB1, LITAF, RTP1, EIF4A2, SDHB, PLPP1, RNF114, RP11-132A1.3, TXN
CALM1, RP11-69I8.2, HIST1H2AC, RUNX1T1, NSUN5, VSIG10, TRIM35, MT1X, HIST1H4H, ATG3
EIF6, TMEM266, MRPL46, ZNF697, RP11-457D13.4, RP11-297M9.1, RGS10, CBR3, LINC01508, TFRC
PC_ 5
Positive: FGF5, KRTAP3-1, GUCY1A3, EDN3, EGR3, SPRR2A, UROD, ZNF804A, SPOCK2, APLN
CCND1, RP11-96B5.3, ERVMER61-1, DIO2, STMND1, S100A1, CD44, PLK2, SYTL2, ZRSR2
CTA-929C8.6, FLJ22447, HP09025, RIC8A, LINC00664, AC002480.3, AC009014.3, CHRNA1, ACTC1, ELL2
Negative: ETV5, BAALC, AURKB, TCOF1, DLX2, PSMC2, CKAP2L, ANGPTL4, HMX2, FBXO5
SPTSSA, ASIC4, UBE2L3, ZNF91, SARS, RASGRF1, WNK4, IDH3B, CASC5, PPP1R1B
NOTO, SSTR1, LINC01612, SLC16A1-AS1, LMO7, NRXN1, BRK1, WNT4, RP11-167B3.2, BARD1
PCAPlot(coef.obj, group.by = "pert", label = T) +
theme(legend.position = "none")
We can also use a nonlinear embedding like UMAP
coef.obj <- RunUMAP(coef.obj, dims = 1:15)
Warning: The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric
To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation'
This message will be shown once per session
15:37:36 UMAP embedding parameters a = 0.9922 b = 1.112
15:37:36 Read 51 rows and found 15 numeric columns
15:37:36 Using Annoy for neighbor search, n_neighbors = 30
15:37:36 Building Annoy index with metric = cosine, n_trees = 50
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
15:37:36 Writing NN index file to temp file /tmp/RtmpYNQjl3/file23962d60d972
15:37:36 Searching Annoy index using 1 thread, search_k = 3000
15:37:36 Annoy recall = 100%
15:37:37 Commencing smooth kNN distance calibration using 1 thread
15:37:37 23 smooth knn distance failures
15:37:37 Initializing from normalized Laplacian + noise
15:37:37 Commencing optimization for 500 epochs, with 1980 positive edges
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
15:37:38 Optimization finished
DimPlot(coef.obj, reduction = "umap", label = T, group.by = "pert") +
theme(legend.position = "none")