perturbLM is a lightweight package for assessing the effects of perturbations on single cell omics datasets. Here, we’ll demonstrate fitting a regularized linear model to a dataset measuring the single cell RNA-seq response to transcription factor overexpression in H1 stem cells.
First download the example dataset from figshare here. The study that generated this dataset is linked here. The dataset is stored as a Seurat object
Next load the necessary libraries
library(perturbLM)
library(Seurat)Registered S3 method overwritten by 'htmlwidgets':
  method           from         
  print.htmlwidget tools:rstudio
Attaching SeuratObject
Attaching splibrary(Matrix)
library(glmnet)Loaded glmnet 4.1-4Read in the dataset and pull out the variable genes
obj <- readRDS("/share/ywu/perturbation-analysis/processed_data/parekh.rds")
var.genes <- intersect(obj@assays$RNA@var.features, rownames(obj))
objAn object of class Seurat 
33694 features across 3329 samples within 1 assay 
Active assay: RNA (33694 features, 4000 variable features)perturbLM offers a wrapper functionm RunLinearModel, that 1) sets up the linear model inputs and outputs, 2) optimizes model hyperparameters, 3) runs the linear model, and 4) evaluates the linear model and assesses which perturbations are likely to have significant effects. RunLinearModel automatically plots the hyperparameter optimization
model.results <- RunLinearModel(obj,
                                pert.col = "condition",
                                batch.col = "batch",
                                size.factor.col = "nCount_RNA",
                                mito.col = "percent.mt",
                                features.use = var.genes)[1] "Assuming response is normalized expression data "Loading required package: caret
Loading required package: ggplot2
Loading required package: lattice
Loading required package: snow[1] "Best alpha: 1"
[1] "Best lambda: 0.05725640732971"The RunLinearModel wrapper function returns a list including the model design matrix (x), response (y) which in this case is the normalized expression matrix, the ElasticNet model (model) and coefficients (coefs), and the model evaluation
names(model.results) [1] "x"           "y"           "perts"       "train.cells" "test.cells"  "family"      "cv"          "model"       "coefs"       "evaluation" The model coefficients represent the effect of each perturbation on gene expression
print(dim(model.results$coefs))[1] 4000   55print(model.results$coefs[1:5, 1:5])                 ASCL1       ASCL3        ASCL4 ASCL5 ATF7
ACTC1     -0.050788886  0.03277820  0.010251644     0    0
LINC00458  0.019186178 -0.03264142 -0.029962373     0    0
NTS        0.194006751 -0.04438682  0.033525631     0    0
ACTA1     -0.031709582  0.02361890 -0.035877048     0    0
HAND1     -0.002172323  0.00314431 -0.006350222     0    0The evaluation dataframe summarizes which perturbations the model was able to effectively predict by comparing the full model with a reduced model that has no perturbation information.
head(model.results$evaluation)The rel_performance metric describes the relative performance of the full vs reduced model for each perturbation. In this case, it’s specifically the log2 transform of the ratio of the full model pearson correlation with the reduced model pearson correlation. You can modify the evaluation metric in RunLinearModel by specifying the eval.metric parameter. Perturbations with a high rel_performance (>>1) are more likely to have a significant effect on expression.
We can pull out all the perturbations with rel_performance > 2.
sig.perts <- subset(model.results$evaluation, rel_performance > 2)$group
sig.perts[1] "CRX"   "ASCL1" "SNAI2" "OTX2"  "MYC"   "MESP1" "ERG"   "KLF4"  "CDX2" And find the top upregulated genes (by coefficient magnitude) for each perturbation
ngenes <- 4
top.sig.genes <- lapply(sig.perts, function(i) {
  top.up <- head(names(sort(model.results$coefs[,i], decreasing = T)), n = ngenes)
  # top.dn <- head(names(sort(model.results$coefs[,i], decreasing = F)), n = ngenes)
  # c(top.up, top.dn)
  top.up
})
top.sig.genes <- unique(unlist(top.sig.genes, F, F))
length(top.sig.genes)[1] 33We can create a new Seurat object with the model coefficients for visualization
pert.coefs <- model.results$coefs[,colnames(model.results$coefs) %in% levels(as.factor(obj$condition))]
coef.obj <- CreateSeuratObject(pert.coefs)
coef.obj <- ScaleData(coef.obj, verbose = F)
coef.obj$pert <- as.factor(colnames(coef.obj))
coef.objAn object of class Seurat 
4000 features across 51 samples within 1 assay 
Active assay: RNA (4000 features, 0 variable features)We can plot the effects of the top perturbations
coef.obj.top <- coef.obj[,sig.perts]
coef.obj.top$pert <- droplevels(coef.obj.top$pert)
levels(coef.obj.top$pert) <- sig.perts
DoHeatmap(coef.obj.top, features = top.sig.genes, group.by = "pert", size = 4, disp.max = 5, draw.lines = F) + 
  scale_fill_gradient2(low = "skyblue", mid = "white", high = "tomato", midpoint = 0) + 
  labs(fill='Scaled Coefficient') Scale for 'fill' is already present. Adding another scale for 'fill', which will replace the existing scale.We can also embed the perturbations using PCA
coef.obj <- RunPCA(coef.obj, features = rownames(coef.obj))Warning in irlba(A = t(x = object), nv = npcs, ...) :
  You're computing too large a percentage of total singular values, use a standard svd instead.
PC_ 1 
Positive:  RP11-598D14.1, GPRC5A, S100A14, LINC01198, T, MAL, NLRP7, SLITRK2, LIM2, RP1-170O19.17 
       SYNE1, FGF17, RP11-402G3.3, AC020571.3, KRT23, ATP6V0D2, CH17-360D5.3, TLX3, CEACAM6, FXYD4 
       INSM2, WNT7A, RP11-189B4.7, SPANXN3, HSBP1L1, RP11-180O5.2, WNT11, ZFY-AS1, TNFSF9, HPGD 
Negative:  SOX2, TERF1, CSRP2, BUB3, CD24, MCM5, PTPRS, L1TD1, OLFM2, THY1 
       PSIP1, SLC1A5, ARG2, EEF1E1, GINS2, SMS, AP1S2, PIM2, TRIAP1, ACP1 
       SALL2, GAPDH, SEPHS1, AKR1B1, CDCA7L, FOXD3-AS1, PIF1, PAICS, TMEM126B, PHC1 
PC_ 2 
Positive:  HOXC9, CRYM, HOXC8, HOXC-AS1, HOXA9, HOXD4, HOXB3, HOXA3, HOXC6, MR1 
       NKX2-8, CCL3, HOXC5, HOXC4, HOXB9, ALOXE3, MIR7-3HG, PYDC1, HOXB7, HOXA-AS3 
       HOXB-AS3, HOXB8, TGFBI, LGALS8, CA10, HRH2, GFRA2, HOXD1, HOXA7, LY6H 
Negative:  SYT1, DDX39A, BICD1, IQGAP2, TUNAR, RP11-490M8.1, TDGF1, TXNRD2, NBR1, SYTL1 
       C6orf48, POU5F1B, PIDD1, CDYL, SOCS1, GPN1, RACGAP1, NIFK, FOXP1, KIF18B 
       RP11-267L5.1, RAPGEF1, PDLIM1, SNHG12, CYP2S1, RPL12, TUBB2A, TRDN, HIST1H4E, VWA1 
PC_ 3 
Positive:  CKB, CLDN7, CDH1, PPA1, CRB3, UACA, ECHDC2, COL18A1, APOBEC3C, C6orf132 
       NOTCH3, SVIP, MAL2, CYSTM1, PPP1R35, BAIAP2L2, WRAP73, TRAP1, DNAJB11, SLC9A3R1 
       C12orf75, ODC1, MAPK13, PSMA3, SLC7A1, SAPCD2, TFB1M, CAST, TMEM125, PGPEP1 
Negative:  RP11-494O16.4, PGM5P2, DKK2, IGF1, FABP1, CD22, CTB-180C19.1, RP11-703H8.7, HIST1H2BG, PTGS2 
       CTC-286N12.1, PLEKHO1, TDO2, GSC, VIM, CNTFR, EDNRB, FOXI2, ICAM2, FAM43B 
       ST8SIA4, CKS1B, SPARCL1, ZNF143, TINAGL1, RRAS, LINC01480, CACNG8, SUGCT, EFEMP1 
PC_ 4 
Positive:  CUZD1, ID4, DLL1, ID2, TFF3, LFNG, NFASC, PCAT7, RP11-451G4.2, ID1 
       ANXA5, MYLPF, MIAT, LAMP5, MSX2, CCKBR, VIL1, SPIB, ABHD8, LOXL1 
       SHD, LHFP, NPTX1, LDHB, HES1, NEXN, SHISA2, RBP1, RP11-180C16.1, METRNL 
Negative:  TCEA1, NDUFAB1, LITAF, RTP1, EIF4A2, SDHB, PLPP1, RNF114, RP11-132A1.3, TXN 
       CALM1, RP11-69I8.2, HIST1H2AC, RUNX1T1, NSUN5, VSIG10, TRIM35, MT1X, HIST1H4H, ATG3 
       EIF6, TMEM266, MRPL46, ZNF697, RP11-457D13.4, RP11-297M9.1, RGS10, CBR3, LINC01508, TFRC 
PC_ 5 
Positive:  FGF5, KRTAP3-1, GUCY1A3, EDN3, EGR3, SPRR2A, UROD, ZNF804A, SPOCK2, APLN 
       CCND1, RP11-96B5.3, ERVMER61-1, DIO2, STMND1, S100A1, CD44, PLK2, SYTL2, ZRSR2 
       CTA-929C8.6, FLJ22447, HP09025, RIC8A, LINC00664, AC002480.3, AC009014.3, CHRNA1, ACTC1, ELL2 
Negative:  ETV5, BAALC, AURKB, TCOF1, DLX2, PSMC2, CKAP2L, ANGPTL4, HMX2, FBXO5 
       SPTSSA, ASIC4, UBE2L3, ZNF91, SARS, RASGRF1, WNK4, IDH3B, CASC5, PPP1R1B 
       NOTO, SSTR1, LINC01612, SLC16A1-AS1, LMO7, NRXN1, BRK1, WNT4, RP11-167B3.2, BARD1 PCAPlot(coef.obj, group.by = "pert", label = T) + 
  theme(legend.position = "none")We can also use a nonlinear embedding like UMAP
coef.obj <- RunUMAP(coef.obj, dims = 1:15)Warning: The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric
To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation'
This message will be shown once per session
15:37:36 UMAP embedding parameters a = 0.9922 b = 1.112
15:37:36 Read 51 rows and found 15 numeric columns
15:37:36 Using Annoy for neighbor search, n_neighbors = 30
15:37:36 Building Annoy index with metric = cosine, n_trees = 50
0%   10   20   30   40   50   60   70   80   90   100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
15:37:36 Writing NN index file to temp file /tmp/RtmpYNQjl3/file23962d60d972
15:37:36 Searching Annoy index using 1 thread, search_k = 3000
15:37:36 Annoy recall = 100%
15:37:37 Commencing smooth kNN distance calibration using 1 thread
15:37:37 23 smooth knn distance failures
15:37:37 Initializing from normalized Laplacian + noise
15:37:37 Commencing optimization for 500 epochs, with 1980 positive edges
0%   10   20   30   40   50   60   70   80   90   100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
15:37:38 Optimization finishedDimPlot(coef.obj, reduction = "umap", label = T, group.by = "pert") + 
  theme(legend.position = "none")