Building Region-Centered TF Networks

Single cell and bulk chromatin accessibility datasets have enabled scientists to study the noncoding regulatory genome with unprecedented resolution. However annotating the biological functions of these regions remains a challenge. Here we demonstrate a computational pipeline to building TF regulatory networks centered around specific noncoding regions, thus adding key biological context. We demonstrate building TF regulatory networks centered around Human Accelerated Regions (HARs), where each TF regulates downstream genes via binding to one or more HARs.

First we need to download the HARs, the single cell chromatin accessibility and gene expression datasets, and a Hi-C dataset generated from sorted Radial Glia cells. The single cell datasets are downsampled from the cerebrum datasets from the chromatin and RNA fetal human cell atlases, and the Hi-C dataset comes from a study of chromatin interaction from sorted fetal cortex cell types.

We start the analysis by installing and/or loading the required libraries. Uncomment the installation code to install packages.

# BiocManager::install()
# BiocManager::install(c("GenomicRanges", "ChIPseeker", "org.Hs.eg.db", "TxDb.Hsapiens.UCSC.hg38.knownGene", "motifmatchr", "chromVAR", "SummarizedExperiment", "BSgenome.Hsapiens.UCSC.hg38"))
# install.packages("igraph")
# devtools::install_github("yanwu2014/chromfunks")
# devtools::install_github("yanwu2014/swne")

library(chromfunks)
library(swne)
library(chromVAR)
library(chromVARmotifs)
library(SummarizedExperiment)
library(motifmatchr)
library(GenomicRanges)
library(ChIPseeker)
library(org.Hs.eg.db)
library(TxDb.Hsapiens.UCSC.hg38.knownGene)
library(BSgenome.Hsapiens.UCSC.hg38)
library(igraph)

Here we load the chromatin and gene expression single cell datasets, filtering for chromatin peaks that are detected in at least 25 cells.

## Load accessibility data
load("../Data/atac_counts_hg38_downsampled.RData")
dim(atac.sampled.counts)

## [1] 1048390   36158

## Load RNA data
load("../Data/atac_rna_downsampled.RData")
dim(atac.rna.counts)

## [1] 29187 31519

## Filter peaks
min.cells <- 25
atac.sampled.counts <- atac.sampled.counts[rowSums(atac.sampled.counts) > min.cells,]
atac.peaks <- atac.peaks[rownames(atac.sampled.counts)]
dim(atac.sampled.counts)

## [1] 599640  36158

We also need to load and format the Hi-C data and the Human Accelerated Regions (HARs)

## Load Hi-C connections
RG.conns <- read.table("../Data/RG.MAPS.peaks.txt", sep = "\t",
                       header = T, stringsAsFactors = F)
RG.conns$Peak1 <- paste0(RG.conns$chr1, ":", RG.conns$start1, "-", RG.conns$end1)
RG.conns$Peak2 <- paste0(RG.conns$chr2, ":", RG.conns$start2, "-", RG.conns$end2)
RG.conns <- RG.conns[,c("Peak1", "Peak2")]
head(RG.conns)

##                    Peak1                  Peak2
## 1   chr1:1135000-1140000   chr1:1435000-1440000
## 2   chr1:8420000-8425000   chr1:8700000-8705000
## 3   chr1:9940000-9945000 chr1:10205000-10210000
## 4   chr1:9940000-9945000 chr1:10395000-10400000
## 5 chr1:28260000-28265000 chr1:28645000-28650000
## 6 chr1:34985000-34990000 chr1:35930000-35935000

## Load HARs
har.regions <- readPeakFile("../Data/HAR_regions_hg38.bed")
names(har.regions) <- granges2peak(har.regions)
head(har.regions)

## GRanges object with 6 ranges and 2 metadata columns:
##                          seqnames            ranges strand |          V4
##                             <Rle>         <IRanges>  <Rle> |    <factor>
##     chr1:7999973-8000320     chr1   7999973-8000320      * | Bird_HAR165
##   chr1:19361591-19361913     chr1 19361591-19361913      * |  Toh_HAR234
##   chr1:20386867-20387046     chr1 20386867-20387046      * | Bird_HAR452
##   chr1:25561377-25561394     chr1 25561377-25561394      * |  Toh_HAR355
##   chr1:26983126-26983454     chr1 26983126-26983454      * | Bird_HAR414
##   chr1:37979601-37979949     chr1 37979601-37979949      * | Bird_HAR159
##                                                   V5
##                                             <factor>
##     chr1:7999973-8000320                      ERRFI1
##   chr1:19361591-19361913                        NBL1
##   chr1:20386867-20387046  DDOST,MUL1,PINK1,LOC339505
##   chr1:25561377-25561394 MAN1C1,FAM54B,LDLRAP1,SEPN1
##   chr1:26983126-26983454         TRNP1,FAM46B,SLC9A1
##   chr1:37979601-37979949                      INPP5B
##   -------
##   seqinfo: 23 sequences from an unspecified genome; no seqlengths

Now that we have all the datasets loaded and formatted, we can start making the networks. First we'll link TFs to HARs using the presence of TF motifs. The TFRegionLinks function also computes a correlation between the overall TF motif activity (as quantified using chromVAR) and the accessibility of HARs using the single cell accessibility dataset. The correlations are computed using binned accessibility counts to improve correlations, similar to how Cicero computes co-accessibility, with the bins computed using nearest neighbors in the UMAP space (although feel free to use any type of dimensional reduction here).

## Link ATAC TFs to regions using motifs and correlations
register(MulticoreParam(8))
tf.peak.df <- TFRegionLinks(atac.sampled.counts, atac.umap.emb[colnames(atac.sampled.counts),],
                            har.regions, n.cores = 8)
head(tf.peak.df)

##      TF                   region        cor
## 1 NFIL3   chr1:60983464-60985043 0.14602608
## 2 NFIL3   chr1:87252413-87253598 0.06789432
## 3 NFIL3 chr1:176693610-176694338 0.05402977
## 4 NFIL3 chr1:209457389-209457747 0.01851922
## 5 NFIL3 chr1:210261952-210263304 0.13349384
## 6 NFIL3   chr2:36388849-36390064 0.10310539

We can then filter links between TFs and HARs using a correlation cutoff. The idea here is that the presence of a TF motif in a genomic region doesn't necessarily mean the TF is bound to that region in a given cell type. By ensuring the overall activity of the TF correlates with the accessibility of the region, we at least know that the TF and the region are active in the same cell types.

## Filter TF - region links using a correlation cutoff
hist(tf.peak.df$cor)

quantile(tf.peak.df$cor, na.rm = T)

##          0%         25%         50%         75%        100% 
## -0.34110081  0.01346722  0.05271586  0.09826845  0.67973153

nrow(tf.peak.df)

## [1] 46572

min.cor <- 0.1
tf.peak.df <- subset(tf.peak.df, cor > min.cor)
nrow(tf.peak.df)

## [1] 11321

Next we link HARs to downstream genes using the Hi-C dataset. We link an HAR to a gene if the HAR is in contact with a peak within 3 kb of the gene's transcription start site.

## Link ATAC HAR/HGE peaks to genes
har.peaks.ix <- findOverlaps(har.regions, atac.peaks)
har.peaks <- atac.peaks[unique(har.peaks.ix@to)]
peak.gene.df <- RegionGeneLinks(har.peaks, RG.conns, link.promoter = T,
                                promoter.region = c(-3e3, 3e3),
                                region.name = NULL, weight.col = NULL)

## >> preparing features information...      2021-01-29 11:00:48 PM 
## >> identifying nearest features...        2021-01-29 11:00:49 PM 
## >> calculating distance from peak to TSS...   2021-01-29 11:00:50 PM 
## >> assigning genomic annotation...        2021-01-29 11:00:50 PM 
## >> adding gene annotation...          2021-01-29 11:01:14 PM 
## >> assigning chromosome lengths           2021-01-29 11:01:14 PM 
## >> done...                    2021-01-29 11:01:14 PM 
## >> preparing features information...      2021-01-29 11:01:16 PM 
## >> identifying nearest features...        2021-01-29 11:01:16 PM 
## >> calculating distance from peak to TSS...   2021-01-29 11:01:17 PM 
## >> assigning genomic annotation...        2021-01-29 11:01:17 PM 
## >> adding gene annotation...          2021-01-29 11:01:22 PM 
## >> assigning chromosome lengths           2021-01-29 11:01:22 PM 
## >> done...                    2021-01-29 11:01:22 PM

head(peak.gene.df)

##                     region      gene weight
## 1   chr1:20385443-20387388   CAMK2N1      1
## 2   chr1:44019177-44020052    SLC6A9      1
## 3   chr1:44631310-44633056    RNF220      1
## 4   chr1:53325625-53328681 LINC01771      1
## 5 chr1:204931301-204931506     NFASC      1
## 6  chr10:24493883-24494528  KIAA1217      1

We can take advantage of our single cell datasets to ensure that the TFs are expressed in the relevant cell types. Since our Hi-C network was generated from Radial Glia (RG), we averaged across cell types to identify TFs that are expressed in Radial Glia.

## Find TFs expressed in Radial Glia
atac.rna.avg.counts <- GetClusterAvg(atac.rna.counts, atac.rna.ident, binarize = T)
hist(atac.rna.avg.counts)

quantile(atac.rna.avg.counts)

##        0%       25%       50%       75%      100% 
## 0.0000000 0.0002000 0.0022000 0.0172000 0.9694999

min.rna.cl.frac <- 0.025
rg.expr <- atac.rna.avg.counts[,"RG"]
rg.genes <- names(rg.expr[rg.expr > min.rna.cl.frac])

We also want to ensure that the HARs that link TFs to genes in our network are accessible in Radial Glia.

## Find peaks accessible in Radial Glia
atac.avg.counts <- GetClusterAvg(LogTFIDF(atac.sampled.counts), atac.sampled.ident)
hist(atac.avg.counts)

quantile(atac.avg.counts)

##          0%         25%         50%         75%        100% 
## 0.000000000 0.004599217 0.008348377 0.015376955 0.692295226

min.cl.access <- 0.01
rg.access <- atac.avg.counts[,"RG"]
rg.peaks <- names(rg.access[rg.access > min.cl.access])

We can then generate a TF to gene network where the TFs are expressed in Radial Glia and the HARs that link TFs to genes are accessible in Radial Glia. The SubsetLinks function returns a list of networks as dataframes, one for linking TFs to HARs, one for linking HARs to genes, and one for linking TFs to genes.

## Subset to HARs accessible in RG and TFs expressed in RG
rg.networks <- SubsetLinks(tf.peak.df, peak.gene.df, regions = rg.peaks, tfs = rg.genes)
sapply(rg.networks, nrow)

##   TF_region_network Region_gene_network     TF_gene_network 
##                2086                 168                 668

head(rg.networks$TF_gene_network)

##     TF     gene
## 1 IRF2 C11orf58
## 2 IRF2     SOX6
## 3 IRF2    MEIS2
## 4 IRF2  TBC1D16
## 5 IRF2   IGFBP2
## 6 IRF2      BBX

Finally we can plot our HAR-centered TF regulatory network that is also specific to Radial Glia cells. The red nodes indicate TFs and the smaller blue nodes are genes.

## Plot RG network
PlotNetwork(rg.networks$TF_gene_network)

As you can see the network is quite large and interconnected, so let's try and identify some interesting biology to zoom in on. First, we want to find the TFs and genes that have the most influence on the network. For this, we can use the eigenvector centrality metric computed using the igraph package. Since we also want to look for genes that are highly interconnected as well, we choose to compute the undirected eigenvector centrality.

rg.graph <- graph_from_data_frame(rg.networks$TF_gene_network)
node.centrality <- eigen_centrality(rg.graph, directed = F)
node.centrality <- sort(node.centrality$vector, decreasing = T)
ggBarplot(head(node.centrality, n = 10))

We can see that ZFHX4, zinc finger protein linked to intellectual disability and speech development is highly central in our Radial Glia network. Thus we zoom into a subnetwork of all genes within 2 degrees of separation of ZFHX4 to get a better sense of the specific genes/TFs involved.

tf.gene.df <- rg.networks$TF_gene_network

ZFHX4.neighbors <- subset(tf.gene.df, TF == "ZFHX4" | gene == "ZFHX4")
ZFHX4.network.df <- subset(tf.gene.df, gene %in% ZFHX4.neighbors$gene | TF %in% ZFHX4.neighbors$TF)
PlotNetwork(ZFHX4.network.df, plot.title = "ZFHX4 Subnetwork", label = T)

Building Region-Centered TF Networks

Yan Wu

1/29/2021