I was going to write a post on using the Seurat alignment method as a batch correction tool but as it turned out the two datasets that I chose didn't seem to have strong batch effects! I heard about the alignment method sometime last year but was motivated to try it out after listening to Andrew Butler talk about it at the Single Cell Genomics Day. Therefore this post is simply on merging two 10x single cell datasets, namely the PBMC4K and PBMC8K datasets. There is already a merge tutorial but here I show the PCA and t-SNE plots. I also have a Getting started with Seurat post that you can check out if you are unfamiliar with the software.
To get started, download the two filtered cell matrices [click on Gene / cell matrix (filtered)] from the 10x datasets page and put them into two separate folders (e.g. pbmc4k and pbmc8k); this is necessary because the folder names are the same when you extract the tarball. Below are the links to the datasets; you might have to enter your email into a webform when you click the links.
Change into the directories where the files were downloaded and extract them.
cd pbmc4k tar -xzf pbmc4k_filtered_gene_bc_matrices.tar.gz cd ../pbmc8k tar -xzf pbmc8k_filtered_gene_bc_matrices.tar.gz
Load up R and install Seurat if you haven't already.
# install if necessary install.packages('Seurat') # load library library(Seurat) # version packageVersion('Seurat') [1] ‘2.2.0’
There is a Read10X() function that can be used to read in the 10x data. You'll need to specify the path to the matrix, genes, and barcode files for each dataset, i.e. the path to the GRCh38 folder.
pbmc4k.data <- Read10X(data.dir = "~/google_drive/muse/seurat/pbmc4k/filtered_gene_bc_matrices/GRCh38/") pbmc4k <- CreateSeuratObject(raw.data = pbmc4k.data, project = "PBMC4K") pbmc8k.data <- Read10X(data.dir = "~/google_drive/muse/seurat/pbmc8k/filtered_gene_bc_matrices/GRCh38/") pbmc8k <- CreateSeuratObject(raw.data = pbmc8k.data, project = "PBMC8K") pbmc4k An object of class seurat in project PBMC4K 33694 genes across 4340 samples. pbmc8k An object of class seurat in project PBMC8K 33694 genes across 8381 samples. pbmc<- MergeSeurat(object1 = pbmc4k, object2 = pbmc8k, add.cell.id1 = "4K", add.cell.id2 = "8K", project = "PBMC12K") pbmc An object of class seurat in project PBMC12K 33694 genes across 12721 samples.
Next we'll filter out cells with high counts of mitochondrial transcripts.
mito.genes <- grep(pattern = "^MT-", x = rownames(x = pbmc@raw.data), value = TRUE) length(mito.genes) [1] 13 percent.mito <- Matrix::colSums(pbmc@raw.data[mito.genes, ]) / Matrix::colSums(pbmc@raw.data) pbmc <- AddMetaData(object = pbmc, metadata = percent.mito, col.name = "percent.mito") VlnPlot(object = pbmc, features.plot = c("nGene", "nUMI", "percent.mito"), nCol = 3)
We'll filter out cells with 9% or higher mitochondrial content (a completely arbitrary threshold that I chose).
ggplot(pbmc@meta.data, aes(nUMI, percent.mito)) + geom_point(size = 0.5) + geom_hline(yintercept = 0.09, linetype = "dashed", colour = "red")
Perform the filtering using the FilterCells() function.
pbmc An object of class seurat in project PBMC12K 33694 genes across 12721 samples. pbmc <- FilterCells(object = pbmc, subset.names = c("nGene", "percent.mito"), low.thresholds = c(200, -Inf), high.thresholds = c(2500, 0.09)) pbmc An object of class seurat in project PBMC12K 33694 genes across 12220 samples.
Next we'll find variable genes for use in the PCA instead of unnecessarily using all the genes. According to the help page:
Setting the y.cutoff parameter to 2 identifies genes that are more than two standard deviations away from the average dispersion within a bin.
I've set this to 0.8 to return around 4,000 genes.
pbmc <- FindVariableGenes(object = pbmc, mean.function = ExpMean, dispersion.function = LogVMR, x.low.cutoff = 0.0125, x.high.cutoff = 3, y.cutoff = 0.8) length(pbmc@var.genes) [1] 4269
We'll adjust gene expression by regressing against the UMI count and percent mitochondrial.
pbmc <- ScaleData(object = pbmc, vars.to.regress = c("nUMI", "percent.mito"))
Now to run the PCA.
pbmc <- RunPCA(object = pbmc, pc.genes = pbmc@var.genes, do.print = TRUE, pcs.print = 1:5, genes.print = 5) PCAPlot(object = pbmc, dim.1 = 1, dim.2 = 2)
And finally we run the t-SNE step.
pbmc <- RunTSNE(object = pbmc, dims.use = 1:10, do.fast = TRUE) TSNEPlot(object = pbmc, do.label = TRUE)
Below I have just separated the two datasets to show that the 8K dataset does indeed have more cells.
my_4k_cell <- grep("^4K", row.names(pbmc@meta.data), value = TRUE) my_8k_cell <- grep("^8K", row.names(pbmc@meta.data), value = TRUE) my_4k_tsne <- TSNEPlot(object = pbmc, do.label = TRUE, cells.use = my_4k_cell, do.return = TRUE) my_8k_tsne <- TSNEPlot(object = pbmc, do.label = TRUE, cells.use = my_8k_cell, do.return = TRUE) plot_grid( my_4k_tsne, my_8k_tsne )
As you can see from the PCA and t-SNE plots, the two datasets overlap each other very well! I realise the two datasets are technical replicates but I was expecting a bit more technical variation.
This work is licensed under a Creative Commons
Attribution 4.0 International License.
Hi Davo,
Thank you for your tutorials for scRNA seq analysis. Your tutorials give me much help!
I always hear that batch effect is one of the major problems we may face in processing scRNA seq data. And you also mentioned in your article. My question is, how did you know if there is strong batch effect in two (or maybe more) datasets? Also through Seurat?
Hi Li,
typically batch effects are noticed through PCA or clustering; for example, if you see that your cells are clustering by replicate rather than by cell type. This review article is a great read https://www.ncbi.nlm.nih.gov/pubmed/20838408 and can be extended to scRNA-seq.
Cheers,
Dave
Hi Dave,
Is there a possibility to merge multiple (eg 12-15) 10x libraries together? When I try MergeSeurat it doesn’t merge more than 2 at a time right? Any tips would be much appreciated.
Best,
Den
Hi Dave,
Thanks for making such great blogs! I was wondering whether it is necessarily to run NormalizeData() after merging multiple Seurat objects (eight in my case) or not?
Many thanks,
Lucy
Hi Lucy,
the post is outdated and Seurat 3 now uses the merge() function instead of MergeSeurat(). From the documentation:
When merging Seurat objects, the merge procedure will merge the Assay level counts and potentially the data slots (depending on the merge.data parameter).
From what I gathered, only the raw counts are merged by default, so you need to perform the normalisation again.
Cheers,
Dave
Hi Dave,
Thanks for making such great blogs.
IS the merging of two cell data similar like making it replicates??
Hi Dave,
How to merge count matrices with different number of genes?
Thank you.
– Gargi
Thanks Dave!
This is very useful, I have been looking for a way to merge two Seurat objects and do the dimension reduction in one go. Thank you so much!
Carrie
Hi Dave,
This is great. Thank you for all your work. Just a small request do you mind doing an example of integrating multiple ScATAC objects or 10x runs to together because when I did merge I do see batch effects and I hard integration works really well
Hi Sudheshna,
thanks for the comment. I’ve never worked with scATAC data but would be interested to learn about it. We can work on it together if you’re interested.
Cheers,
Dave
Hi Dave,
Absolutely, I just started working on a project and would love to work on that with you
OK let’s do it! Please email me@davetang.org and we can discuss how to get started!