Single cell and bulk chromatin accessibility datasets have enabled scientists to study the noncoding regulatory genome with unprecedented resolution. However annotating the biological functions of these regions remains a challenge. Here we demonstrate a computational pipeline to building TF regulatory networks centered around specific noncoding regions, thus adding key biological context. We demonstrate building TF regulatory networks centered around Human Accelerated Regions (HARs), where each TF regulates downstream genes via binding to one or more HARs.
First we need to download the HARs, the single cell chromatin accessibility and gene expression datasets, and a Hi-C dataset generated from sorted Radial Glia cells. The single cell datasets are downsampled from the cerebrum datasets from the chromatin and RNA fetal human cell atlases, and the Hi-C dataset comes from a study of chromatin interaction from sorted fetal cortex cell types.
We start the analysis by installing and/or loading the required libraries. Uncomment the installation code to install packages.
# BiocManager::install()
# BiocManager::install(c("GenomicRanges", "ChIPseeker", "org.Hs.eg.db", "TxDb.Hsapiens.UCSC.hg38.knownGene", "motifmatchr", "chromVAR", "SummarizedExperiment", "BSgenome.Hsapiens.UCSC.hg38"))
# install.packages("igraph")
# devtools::install_github("yanwu2014/chromfunks")
# devtools::install_github("yanwu2014/swne")
library(chromfunks)
library(swne)
library(chromVAR)
library(chromVARmotifs)
library(SummarizedExperiment)
library(motifmatchr)
library(GenomicRanges)
library(ChIPseeker)
library(org.Hs.eg.db)
library(TxDb.Hsapiens.UCSC.hg38.knownGene)
library(BSgenome.Hsapiens.UCSC.hg38)
library(igraph)
Here we load the chromatin and gene expression single cell datasets, filtering for chromatin peaks that are detected in at least 25 cells.
## Load accessibility data
load("../Data/atac_counts_hg38_downsampled.RData")
dim(atac.sampled.counts)
## [1] 1048390 36158
## Load RNA data
load("../Data/atac_rna_downsampled.RData")
dim(atac.rna.counts)
## [1] 29187 31519
## Filter peaks
min.cells <- 25
atac.sampled.counts <- atac.sampled.counts[rowSums(atac.sampled.counts) > min.cells,]
atac.peaks <- atac.peaks[rownames(atac.sampled.counts)]
dim(atac.sampled.counts)
## [1] 599640 36158
We also need to load and format the Hi-C data and the Human Accelerated Regions (HARs)
## Load Hi-C connections
RG.conns <- read.table("../Data/RG.MAPS.peaks.txt", sep = "\t",
header = T, stringsAsFactors = F)
RG.conns$Peak1 <- paste0(RG.conns$chr1, ":", RG.conns$start1, "-", RG.conns$end1)
RG.conns$Peak2 <- paste0(RG.conns$chr2, ":", RG.conns$start2, "-", RG.conns$end2)
RG.conns <- RG.conns[,c("Peak1", "Peak2")]
head(RG.conns)
## Peak1 Peak2
## 1 chr1:1135000-1140000 chr1:1435000-1440000
## 2 chr1:8420000-8425000 chr1:8700000-8705000
## 3 chr1:9940000-9945000 chr1:10205000-10210000
## 4 chr1:9940000-9945000 chr1:10395000-10400000
## 5 chr1:28260000-28265000 chr1:28645000-28650000
## 6 chr1:34985000-34990000 chr1:35930000-35935000
## Load HARs
har.regions <- readPeakFile("../Data/HAR_regions_hg38.bed")
names(har.regions) <- granges2peak(har.regions)
head(har.regions)
## GRanges object with 6 ranges and 2 metadata columns:
## seqnames ranges strand | V4
## <Rle> <IRanges> <Rle> | <factor>
## chr1:7999973-8000320 chr1 7999973-8000320 * | Bird_HAR165
## chr1:19361591-19361913 chr1 19361591-19361913 * | Toh_HAR234
## chr1:20386867-20387046 chr1 20386867-20387046 * | Bird_HAR452
## chr1:25561377-25561394 chr1 25561377-25561394 * | Toh_HAR355
## chr1:26983126-26983454 chr1 26983126-26983454 * | Bird_HAR414
## chr1:37979601-37979949 chr1 37979601-37979949 * | Bird_HAR159
## V5
## <factor>
## chr1:7999973-8000320 ERRFI1
## chr1:19361591-19361913 NBL1
## chr1:20386867-20387046 DDOST,MUL1,PINK1,LOC339505
## chr1:25561377-25561394 MAN1C1,FAM54B,LDLRAP1,SEPN1
## chr1:26983126-26983454 TRNP1,FAM46B,SLC9A1
## chr1:37979601-37979949 INPP5B
## -------
## seqinfo: 23 sequences from an unspecified genome; no seqlengths
Now that we have all the datasets loaded and formatted, we can start making the networks. First we'll link TFs to HARs using the presence of TF motifs. The TFRegionLinks function also computes a correlation between the overall TF motif activity (as quantified using chromVAR) and the accessibility of HARs using the single cell accessibility dataset. The correlations are computed using binned accessibility counts to improve correlations, similar to how Cicero computes co-accessibility, with the bins computed using nearest neighbors in the UMAP space (although feel free to use any type of dimensional reduction here).
## Link ATAC TFs to regions using motifs and correlations
register(MulticoreParam(8))
tf.peak.df <- TFRegionLinks(atac.sampled.counts, atac.umap.emb[colnames(atac.sampled.counts),],
har.regions, n.cores = 8)
head(tf.peak.df)
## TF region cor
## 1 NFIL3 chr1:60983464-60985043 0.14602608
## 2 NFIL3 chr1:87252413-87253598 0.06789432
## 3 NFIL3 chr1:176693610-176694338 0.05402977
## 4 NFIL3 chr1:209457389-209457747 0.01851922
## 5 NFIL3 chr1:210261952-210263304 0.13349384
## 6 NFIL3 chr2:36388849-36390064 0.10310539
We can then filter links between TFs and HARs using a correlation cutoff. The idea here is that the presence of a TF motif in a genomic region doesn't necessarily mean the TF is bound to that region in a given cell type. By ensuring the overall activity of the TF correlates with the accessibility of the region, we at least know that the TF and the region are active in the same cell types.
## Filter TF - region links using a correlation cutoff
hist(tf.peak.df$cor)
quantile(tf.peak.df$cor, na.rm = T)
## 0% 25% 50% 75% 100%
## -0.34110081 0.01346722 0.05271586 0.09826845 0.67973153
nrow(tf.peak.df)
## [1] 46572
min.cor <- 0.1
tf.peak.df <- subset(tf.peak.df, cor > min.cor)
nrow(tf.peak.df)
## [1] 11321
Next we link HARs to downstream genes using the Hi-C dataset. We link an HAR to a gene if the HAR is in contact with a peak within 3 kb of the gene's transcription start site.
## Link ATAC HAR/HGE peaks to genes
har.peaks.ix <- findOverlaps(har.regions, atac.peaks)
har.peaks <- atac.peaks[unique(har.peaks.ix@to)]
peak.gene.df <- RegionGeneLinks(har.peaks, RG.conns, link.promoter = T,
promoter.region = c(-3e3, 3e3),
region.name = NULL, weight.col = NULL)
## >> preparing features information... 2021-01-29 11:00:48 PM
## >> identifying nearest features... 2021-01-29 11:00:49 PM
## >> calculating distance from peak to TSS... 2021-01-29 11:00:50 PM
## >> assigning genomic annotation... 2021-01-29 11:00:50 PM
## >> adding gene annotation... 2021-01-29 11:01:14 PM
## >> assigning chromosome lengths 2021-01-29 11:01:14 PM
## >> done... 2021-01-29 11:01:14 PM
## >> preparing features information... 2021-01-29 11:01:16 PM
## >> identifying nearest features... 2021-01-29 11:01:16 PM
## >> calculating distance from peak to TSS... 2021-01-29 11:01:17 PM
## >> assigning genomic annotation... 2021-01-29 11:01:17 PM
## >> adding gene annotation... 2021-01-29 11:01:22 PM
## >> assigning chromosome lengths 2021-01-29 11:01:22 PM
## >> done... 2021-01-29 11:01:22 PM
head(peak.gene.df)
## region gene weight
## 1 chr1:20385443-20387388 CAMK2N1 1
## 2 chr1:44019177-44020052 SLC6A9 1
## 3 chr1:44631310-44633056 RNF220 1
## 4 chr1:53325625-53328681 LINC01771 1
## 5 chr1:204931301-204931506 NFASC 1
## 6 chr10:24493883-24494528 KIAA1217 1
We can take advantage of our single cell datasets to ensure that the TFs are expressed in the relevant cell types. Since our Hi-C network was generated from Radial Glia (RG), we averaged across cell types to identify TFs that are expressed in Radial Glia.
## Find TFs expressed in Radial Glia
atac.rna.avg.counts <- GetClusterAvg(atac.rna.counts, atac.rna.ident, binarize = T)
hist(atac.rna.avg.counts)
quantile(atac.rna.avg.counts)
## 0% 25% 50% 75% 100%
## 0.0000000 0.0002000 0.0022000 0.0172000 0.9694999
min.rna.cl.frac <- 0.025
rg.expr <- atac.rna.avg.counts[,"RG"]
rg.genes <- names(rg.expr[rg.expr > min.rna.cl.frac])
We also want to ensure that the HARs that link TFs to genes in our network are accessible in Radial Glia.
## Find peaks accessible in Radial Glia
atac.avg.counts <- GetClusterAvg(LogTFIDF(atac.sampled.counts), atac.sampled.ident)
hist(atac.avg.counts)
quantile(atac.avg.counts)
## 0% 25% 50% 75% 100%
## 0.000000000 0.004599217 0.008348377 0.015376955 0.692295226
min.cl.access <- 0.01
rg.access <- atac.avg.counts[,"RG"]
rg.peaks <- names(rg.access[rg.access > min.cl.access])
We can then generate a TF to gene network where the TFs are expressed in Radial Glia and the HARs that link TFs to genes are accessible in Radial Glia. The SubsetLinks function returns a list of networks as dataframes, one for linking TFs to HARs, one for linking HARs to genes, and one for linking TFs to genes.
## Subset to HARs accessible in RG and TFs expressed in RG
rg.networks <- SubsetLinks(tf.peak.df, peak.gene.df, regions = rg.peaks, tfs = rg.genes)
sapply(rg.networks, nrow)
## TF_region_network Region_gene_network TF_gene_network
## 2086 168 668
head(rg.networks$TF_gene_network)
## TF gene
## 1 IRF2 C11orf58
## 2 IRF2 SOX6
## 3 IRF2 MEIS2
## 4 IRF2 TBC1D16
## 5 IRF2 IGFBP2
## 6 IRF2 BBX
Finally we can plot our HAR-centered TF regulatory network that is also specific to Radial Glia cells. The red nodes indicate TFs and the smaller blue nodes are genes.
## Plot RG network
PlotNetwork(rg.networks$TF_gene_network)
As you can see the network is quite large and interconnected, so let's try and identify some interesting biology to zoom in on. First, we want to find the TFs and genes that have the most influence on the network. For this, we can use the eigenvector centrality metric computed using the igraph package. Since we also want to look for genes that are highly interconnected as well, we choose to compute the undirected eigenvector centrality.
rg.graph <- graph_from_data_frame(rg.networks$TF_gene_network)
node.centrality <- eigen_centrality(rg.graph, directed = F)
node.centrality <- sort(node.centrality$vector, decreasing = T)
ggBarplot(head(node.centrality, n = 10))
We can see that ZFHX4, zinc finger protein linked to intellectual disability and speech development is highly central in our Radial Glia network. Thus we zoom into a subnetwork of all genes within 2 degrees of separation of ZFHX4 to get a better sense of the specific genes/TFs involved.
tf.gene.df <- rg.networks$TF_gene_network
ZFHX4.neighbors <- subset(tf.gene.df, TF == "ZFHX4" | gene == "ZFHX4")
ZFHX4.network.df <- subset(tf.gene.df, gene %in% ZFHX4.neighbors$gene | TF %in% ZFHX4.neighbors$TF)
PlotNetwork(ZFHX4.network.df, plot.title = "ZFHX4 Subnetwork", label = T)