Analyzing independently assayed scATAC-seq and scRNA-seq data using FigR

Running FigR on independently assayed scATAC-seq and scRNA-seq data

In many instances, we don’t have true multi-modal data (i.e. chromatin accessibility and gene expression profiled from the same single cell). In these cases, computationally determining “paired” cells across assays from the same cellular contexts is useful, and is a necessary first step prior to performing downstream analysis, including the cis-regulatory correlation analyses (i.e. peak-to-gene associations) and gene regulatory network inference steps that are part of the FigR framework.

Here we demo the application of the FigR workflow to independently generated scATAC-seq and scRNA-seq data using a subset of the data generated and analyzed in the associated manuscript ¹

Data downloads / Pre-processing

Here, we will use a subset of the PBMC stimulus dataset pertaining to DMSO (Control_1h) condition cells for which we have independently profiled scATAC-seq and scRNA-seq data. The data used in this tutorial can be downloaded (WARNING: Downloading this data will take up ~ 71 MB of your hard drive space) as shown below. We then set the unzipped directory (destination folder) to be the working directory, prior to loading the R objects.

Please see this vignette on applying FigR to multi-modal data for more details on the input object data structures, function parameters, and run time

# Directory where data will be downloaded to
workingDir <- "/Users/vinay/Documents/FigR_package/biorad"
stimZip <- "https://s3.us-east-1.amazonaws.com/vkartha/FigR/FigR_stim.zip"
download.file(url = stimZip,
              destfile = paste(workingDir,basename(stimZip),sep = "/"))
unzip(zipfile = paste(workingDir,basename(stimZip),sep = "/"),exdir = workingDir,overwrite = FALSE)

setwd(workingDir)

As before, let us load some dependencies, including the main FigR package

library(doParallel)
library(BuenColors)
library(FigR)

#> Warning: package 'cowplot' was built under R version 4.0.4

library(BSgenome.Hsapiens.UCSC.hg19)

Here, we require/load the following input objects:

A SummarizedExperiment object of peaks x cells, housing the scATAC-seq reads in accessibility peaks raw counts, along with the GRanges of the peaks used as RowRanges info
A normalized sparseMatrix object of genes x cells, housing the scRNA-seq gene counts (e.g. exported from Seurat)
Previously computed canonical correlation analyis (CCA) components (determined using the CCA pipeline in Seurat) for both ATAC and RNA PBMCs
Latent Semantic Index (LSI) components for the scATAC-seq cells, used to derive cell k-nearest neighbors to smooth scores with (NOTE: In the multi-modal vignette, we show how you can use alternative cell embeddings to derive this kNN, such as cisTopic scores. Essentially, you would use whichever reduced dimensions were used to project cells in 2D for visualization)

In this example, we show these inputs pertaining to independently generated scATAC and scRNA-seq data in human PBMCs. For simplicity, we only run this on a subset of cells (quality filtered, Control 1h DMSO condition cells for faster run-time). As a result, note that the results will be similar to what was presented in Kartha et al. , but not identical.

ATAC.se <- readRDS("./control1h_PBMC_atac_SE.rds")
RNAmat <- readRDS("./control1h_PBMC_RNAnorm.rds")


dim(ATAC.se) # Peaks x ATAC cells

#> [1] 219136   5352

dim(RNAmat) # Genes x RNA cells

#> [1] 15584  3508

Since we’re going to integrate the two datasets here (unpaired as of now), we will use a shared co-embedding that can be generated for both ATAC and RNA cells. See this tutorial for examples of how this can be done. In our case, we used the union of the top variable ATAC genescores (based on a fixed window around each gene’s TSS), and top variable RNA genes across all cells, and performed a CCA using the resulting features and all ATAC/RNA cells as input to the RunCCA function. There are now alternative approaches that are faster / scale better for co-embedding larger datasets, including the RPCA framework implemented in the Seurat package.

# Load CCA components (used for pairing)
# This was derived by running CCA (using all cells) between the original ATAC/RNA data using the top variable scATAC-seq gene scores and scRNA-seq gene expression
CCA_PCs <- readRDS('./control1h_PBMC_atac_rna_CCA_l2.rds')
dim(CCA_PCs) # ATAC + RNA (rows), n components (columns)

#> [1] 8860   50

# This is just to differentiate ATAC/RNA from the combined matrix of CCA components so they can be separated for pairing
# Based on barcode naming conventions to differentiate ATAC/RNA in our

head(rownames(CCA_PCs)) # ATAC cells

#> [1] "Control_1h_Donor1_S1_BC0004_N01" "Control_1h_Donor1_S1_BC0006_N01"
#> [3] "Control_1h_Donor1_S1_BC0008_N01" "Control_1h_Donor1_S1_BC0009_N01"
#> [5] "Control_1h_Donor1_S1_BC0010_N01" "Control_1h_Donor1_S1_BC0012_N01"

tail(rownames(CCA_PCs)) # RNA cells

#> [1] "Control_1h_Donor4_tgtagtggagttgcacgttgg"
#> [2] "Control_1h_Donor4_aatggccgcacagccgcgctt"
#> [3] "Control_1h_Donor4_cggccaggtcggttttggtta"
#> [4] "Control_1h_Donor4_ggcaggctccttaactggcat"
#> [5] "Control_1h_Donor4_gcagtgtactacgactggcat"
#> [6] "Control_1h_Donor4_tcagcaatcgcgcattcctct"

isATAC <- grepl("BC",rownames(CCA_PCs))
table(isATAC) # ATAC vs RNA

#> isATAC
#> FALSE  TRUE 
#>  3508  5352

ATACcells <- rownames(CCA_PCs)[isATAC]
RNAcells <- rownames(CCA_PCs)[!isATAC]

cat("ATAC cells in condition: ",length(ATACcells),"\n")

#> ATAC cells in condition:  5352

cat("RNA cells in condition: ",length(RNAcells),"\n")

#> RNA cells in condition:  3508

We can now visualize UMAP of the ATAC-RNA cell co-embedding, coloring cells by the assay they come from (similar / shared cell states across the two assays should co-localize to some degree in 2D)

nPCs <- 20 # Num CCA PCs to use when running UMAP / pairing

# Run UMAP
set.seed(123)
umap.out <- uwot::umap(CCA_PCs[,1:nPCs],
                       metric="cosine",
                       n_neighbors=30)

umap.d <- as.data.frame(umap.out)
colnames(umap.d) <- c("UMAP1","UMAP2")
rownames(umap.d) <- rownames(CCA_PCs)
  
umap.d$Assay <- ifelse(isATAC,"ATAC","RNA")

BuenColors::shuf(umap.d) %>% 
  ggplot(aes(UMAP1,UMAP2,color=Assay)) + 
  geom_point(size=0.1) + 
  theme_classic() + 
  scale_color_manual(values = c("cadetblue","darkorange"))+
  guides(colour = guide_legend(override.aes = list(size=3)))

Now, we use these co-embeddings to pair cells using our geodesic distance-based pairing approach. We do this step using the pairCells function, providing the separate ATAC and RNA cell components as input as highlighted below:

ATAC_PCs <- CCA_PCs[isATAC,]
RNA_PCs <- CCA_PCs[!isATAC,]

pairing <- pairCells(ATAC = ATAC_PCs,
                     RNA = RNA_PCs,
                     keepUnique = TRUE)

By setting the keepUnique parameter specified above to TRUE, we further filter cells such that each ATAC cell barcode only appears once in the resulting pairs (in the case above).

We can now visualize the resulting ATAC-RNA pairs on the CCA UMAP

plotPairs(ATAC = pairing$ATAC,
          RNA=pairing$RNA,
          max.show = 100,
          umap.df = umap.d)

#> Assuming concordance in pairs ..
#> Only highlighting 100 pairs at random..

#> Loading required package: ggrastr

#> Warning: package 'ggrastr' was built under R version 4.0.4

As the final step, we subset the ATAC and RNA count objects to match the determined ATAC-RNA paired cells, and use these as input in later steps

ATAC.se.paired <- ATAC.se[,pairing$ATAC]
RNAmat.paired <- RNAmat[,pairing$RNA]

Peak-gene association testing

# Don't run interactively
cisCorr <- runGenePeakcorr(ATAC.se = ATAC.se.paired,
                           RNAmat = RNAmat.paired,
                           genome = "hg19", # One of hg19, mm10 or hg38 
                           nCores = 4,
                           p.cut = NULL, # Set this to NULL and we can filter later
                           n_bg = 100)

Determining DORCs

We can now filter correlations based on the background p-value, and summarize the genes that have a relatively high number of peak-gene associations (i.e. Domains of regulatory chromatin or ‘DORCs’)

head(cisCorr)

#>   Peak         PeakRanges      Gene        rObs       pvalZ
#> 1    2 chr1:713966-714266 LINC01128 0.018159834 0.100025782
#> 2    5 chr1:756685-756985 LINC01128 0.004361716 0.318375663
#> 3    8 chr1:777265-777565 LINC01128 0.006061255 0.165574826
#> 4   19 chr1:839948-840248    SAMD11 0.003911159 0.289553871
#> 5   23 chr1:848168-848468    SAMD11 0.045598153 0.005385101
#> 6   26 chr1:859076-859376    SAMD11 0.031427158 0.107115444

cisCorr.filt <- cisCorr %>% filter(pvalZ <= 0.05)

library(ggplot2)
dorcGenes <- dorcJPlot(dorcTab = cisCorr.filt,
                         cutoff = 10, # No. sig peaks needed to be called a DORC
                         labelTop = 20,
                         returnGeneList = TRUE, # Set this to FALSE for just the plot
                         force=5)

We can also just get the full ranked table of genes and number of significantly associated peaks, instead of only getting the gene names (so you can threshold yourself)

# Unfiltered
numDorcs <- cisCorr.filt %>% group_by(Gene) %>% tally() %>% arrange(desc(n))
numDorcs

#> # A tibble: 9,272 x 2
#>    Gene         n
#>    <chr>    <int>
#>  1 IL7R        23
#>  2 TCF7        23
#>  3 CD83        22
#>  4 SEMA7A      20
#>  5 APOBEC3G    19
#>  6 CCR7        18
#>  7 CD8A        18
#>  8 PAX5        18
#>  9 SLC7A5      18
#> 10 ANPEP       17
#> # ... with 9,262 more rows

Since these are unstimulated (control) PBMCs, we expect most of these genes to be lineage-determining markers.

Visualizing DORCs

To get the DORC accessibility scores, we can sum up the chromatin accessibility peak counts for peaks associated with a given gene.

NOTE: It is crucial that you use the exact same SummarizedExperiment object here as the one that was used in the runGenePeakcorr step

dorcMat <- getDORCScores(ATAC.se = ATAC.se.paired, # Has to be same SE as used in previous step
                         dorcTab = cisCorr.filt,
                         geneList = dorcGenes,
                         nCores = 4)

#> ........
#> Normalizing scATAC counts ..
#> Centering counts for cells sequentially in groups of size  5000  ..
#> 
#> Computing centered counts for cells:  1  to  4912 ..
#> Computing centered counts per cell using mean reads in features ..
#> 
#> Merging results..
#> Done!
#> Computing DORC scores ..
#> Running in parallel using  1 cores ..
#> 
#> Time Elapsed:  10.2501220703125 secs

dim(dorcMat)

#> [1]   99 4912

We then plot these scores, along with the paired RNA expression onto the original scATAC-seq UMAP. Here’s a look at what that 2D embedding looks like (also highlighted in the manuscript), just for this subset of cells:

colData(ATAC.se.paired) %>% 
  as.data.frame() %>% 
  ggplot(aes(UMAP1,UMAP2)) + 
  geom_point(color="gray",size=0.8)+ theme_classic()

We can then smooth these (sparse) DORC counts, and visualize side-by-side with the expression, on the same UMAP. In this particular case, we use the original Latent Semantic Indexing (LSI) cell embeddings that were pre-determined using all cells (using which the 2D UMAP was determined), subset to the cells being used, and then derive cell k-nearest neighbors, which we will use to smooth both the DORC and [aired RNA expression, for the same UMAP:

lsi <- readRDS("./control1h_PBMC_atac_lsi.rds")
dim(lsi)

stopifnot(all(colnames(dorcMat) %in% rownames(lsi)))

# Subset to paired ATAC
lsi <- lsi[colnames(dorcMat),]

dim(lsi)

# Get cell KNNs
cellkNN <- FNN::get.knn(lsi,k=30)$nn.index
rownames(cellkNN) <- colnames(dorcMat)

# Smooth dorc scores using cell KNNs (k=30)
library(doParallel)
dorcMat.s <- smoothScoresNN(NNmat = cellkNN,mat = dorcMat,nCores = 4)

# Smooth RNA using cell KNNs
# This takes longer since it's all genes
colnames(RNAmat.paired) <- colnames(ATAC.se.paired) # Just so that the smoothing function doesn't throw an error (matching cell barcodes in the KNN and the matrix)
RNAmat.s <- smoothScoresNN(NNmat = cellkNN,mat = RNAmat.paired,nCores = 4)

# Visualize on pre-computed UMAP
# This is the ATAC UMAP shown in the paper (based on ATAC LSI)
umap.d <- as.data.frame(colData(ATAC.se.paired)[,c("UMAP1","UMAP2")])

# DORC scores for top DORC(s)
myDORCs <- c("SEMA7A","IL7R","CD93")

dorcGGlist <- lapply(myDORCs,function(x) { 
  plotMarker2D(umap.d,
               dorcMat.s,
               markers = x,
               maxCutoff = "q0.99",
               colorPalette = "brewer_heat"
  ) + ggtitle(paste0(x," DORC"))
})

#> Plotting  SEMA7A 
#> Plotting  IL7R 
#> Plotting  CD93

# Paired RNA expression for top DORC(s)
# Plot on the same reference ATAC UMAP
rnaGGlist <- lapply(myDORCs,function(x) { 
  plotMarker2D(umap.d,
               RNAmat.s,
               markers = x,
               maxCutoff = "q0.99",
               colorPalette = "brewer_purple"
  ) + ggtitle(paste0(x," RNA"))
})

#> Plotting  SEMA7A 
#> Plotting  IL7R 
#> Plotting  CD93

library(patchwork)

#> Warning: package 'patchwork' was built under R version 4.0.4

(dorcGGlist[[1]] + dorcGGlist[[2]] + dorcGGlist[[3]]) /  (rnaGGlist[[1]] + rnaGGlist[[2]] + rnaGGlist[[3]])

TF-gene associations

As before, we can now determine TF-gene associations, and begin inferring a regulatory network based on the previous DORC definitions, together with information drawn from a database of TF binding sequence motifs

figR.d <- runFigRGRN(ATAC.se = ATAC.se.paired, # Must be the same input as used in runGenePeakcorr()
                     dorcTab = cisCorr.filt, # Filtered peak-gene associations
                     genome = "hg19",
                     dorcMat = dorcMat.s,
                     dorcK = 5, 
                     rnaMat = RNAmat.s, 
                     nCores = 4)

Visualizing FigR results

Global regulation profile

Scatter of correlation (x) and motif enrichment (y)

figR.d %>% 
  ggplot(aes(Corr.log10P,Enrichment.log10P,color=Score)) + 
  ggrastr::geom_point_rast(size=0.01,shape=16) + 
  theme_classic() + 
  scale_color_gradientn(colours = jdb_palette("solar_extra"),limits=c(-3,3),oob = scales::squish,breaks=scales::breaks_pretty(n=3))

Looking at the top TFs associated with DORCs

Ranking TF drivers

rankDrivers(figR.d,rankBy = "meanScore")

#> Warning: ggrepel: 63 unlabeled data points (too many overlaps). Consider
#> increasing max.overlaps

rankDrivers(figR.d,score.cut = 1.5,rankBy = "nTargets",interactive = TRUE)

You can also set the interactive parameter (which the rankDrivers function takes) to TRUE, which will let you hover over points and provide some extra information in terms of the number of estimated activated vs repressed genes

Heatmap view

library(ComplexHeatmap)
plotfigRHeatmap(figR.d = figR.d,
                score.cut = 1.5,
                column_names_gp = gpar(fontsize=6), # from ComplexHeatmap
                show_row_dend = FALSE # from ComplexHeatmap
                )

Network view

library(networkD3)

#> Warning: package 'networkD3' was built under R version 4.0.5

plotfigRNetwork(figR.d,
                      score.cut = 1.5,
                      weight.edges = TRUE)

Vinay Kartha

2022-03-26