I was going to write a post on using the Seurat alignment method as a batch correction tool but as it turned out the two datasets that I chose didn’t seem to have strong batch effects! I heard about the alignment method sometime last year but was motivated to try it out after listening to Andrew Butler talk about it at the Single Cell Genomics Day. Therefore this post is simply on merging two 10x single cell datasets, namely the PBMC4K and PBMC8K datasets. There is already a merge tutorial but here I show the PCA and t-SNE plots. I also have a Getting started with Seurat post that you can check out if you are unfamiliar with the software.
To get started, download the two filtered cell matrices [click on Gene / cell matrix (filtered)] from the 10x datasets page and put them into two separate folders (e.g. pbmc4k and pbmc8k); this is necessary because the folder names are the same when you extract the tarball. Below are the links to the datasets; you might have to enter your email into a webform when you click the links.
Change into the directories where the files were downloaded and extract them.
cd pbmc4k tar -xzf pbmc4k_filtered_gene_bc_matrices.tar.gz cd ../pbmc8k tar -xzf pbmc8k_filtered_gene_bc_matrices.tar.gz
Load up R and install Seurat if you haven’t already.
# install if necessary install.packages('Seurat') # load library library(Seurat) # version packageVersion('Seurat')  ‘2.2.0’
There is a Read10X() function that can be used to read in the 10x data. You’ll need to specify the path to the matrix, genes, and barcode files for each dataset, i.e. the path to the GRCh38 folder.
pbmc4k.data <- Read10X(data.dir = "~/google_drive/muse/seurat/pbmc4k/filtered_gene_bc_matrices/GRCh38/") pbmc4k <- CreateSeuratObject(raw.data = pbmc4k.data, project = "PBMC4K") pbmc8k.data <- Read10X(data.dir = "~/google_drive/muse/seurat/pbmc8k/filtered_gene_bc_matrices/GRCh38/") pbmc8k <- CreateSeuratObject(raw.data = pbmc8k.data, project = "PBMC8K") pbmc4k An object of class seurat in project PBMC4K 33694 genes across 4340 samples. pbmc8k An object of class seurat in project PBMC8K 33694 genes across 8381 samples. pbmc<- MergeSeurat(object1 = pbmc4k, object2 = pbmc8k, add.cell.id1 = "4K", add.cell.id2 = "8K", project = "PBMC12K") pbmc An object of class seurat in project PBMC12K 33694 genes across 12721 samples.
Next we’ll filter out cells with high counts of mitochondrial transcripts.
mito.genes <- grep(pattern = "^MT-", x = rownames(x = firstname.lastname@example.org), value = TRUE) length(mito.genes)  13 percent.mito <- Matrix::colSums(email@example.com[mito.genes, ]) / Matrix::colSums(firstname.lastname@example.org) pbmc <- AddMetaData(object = pbmc, metadata = percent.mito, col.name = "percent.mito") VlnPlot(object = pbmc, features.plot = c("nGene", "nUMI", "percent.mito"), nCol = 3)
We’ll filter out cells with 9% or higher mitochondrial content (a completely arbitrary threshold that I chose).
ggplot(email@example.com, aes(nUMI, percent.mito)) + geom_point(size = 0.5) + geom_hline(yintercept = 0.09, linetype = "dashed", colour = "red")
Perform the filtering using the FilterCells() function.
pbmc An object of class seurat in project PBMC12K 33694 genes across 12721 samples. pbmc <- FilterCells(object = pbmc, subset.names = c("nGene", "percent.mito"), low.thresholds = c(200, -Inf), high.thresholds = c(2500, 0.09)) pbmc An object of class seurat in project PBMC12K 33694 genes across 12220 samples.
Next we’ll find variable genes for use in the PCA instead of unnecessarily using all the genes. According to the help page:
Setting the y.cutoff parameter to 2 identifies genes that are more than two standard deviations away from the average dispersion within a bin.
I’ve set this to 0.8 to return around 4,000 genes.
pbmc <- FindVariableGenes(object = pbmc, mean.function = ExpMean, dispersion.function = LogVMR, x.low.cutoff = 0.0125, x.high.cutoff = 3, y.cutoff = 0.8) length(firstname.lastname@example.org)  4269
We’ll adjust gene expression by regressing against the UMI count and percent mitochondrial.
pbmc <- ScaleData(object = pbmc, vars.to.regress = c("nUMI", "percent.mito"))
Now to run the PCA.
pbmc <- RunPCA(object = pbmc, pc.genes = email@example.com, do.print = TRUE, pcs.print = 1:5, genes.print = 5) PCAPlot(object = pbmc, dim.1 = 1, dim.2 = 2)
And finally we run the t-SNE step.
pbmc <- RunTSNE(object = pbmc, dims.use = 1:10, do.fast = TRUE) TSNEPlot(object = pbmc, do.label = TRUE)
Below I have just separated the two datasets to show that the 8K dataset does indeed have more cells.
my_4k_cell <- grep("^4K", row.names(firstname.lastname@example.org), value = TRUE) my_8k_cell <- grep("^8K", row.names(email@example.com), value = TRUE) my_4k_tsne <- TSNEPlot(object = pbmc, do.label = TRUE, cells.use = my_4k_cell, do.return = TRUE) my_8k_tsne <- TSNEPlot(object = pbmc, do.label = TRUE, cells.use = my_8k_cell, do.return = TRUE) plot_grid( my_4k_tsne, my_8k_tsne )
As you can see from the PCA and t-SNE plots, the two datasets overlap each other very well! I realise the two datasets are technical replicates but I was expecting a bit more technical variation.
This work is licensed under a Creative Commons
Attribution 4.0 International License.