Merging two 10x single cell datasets

I was going to write a post on using the Seurat alignment method as a batch correction tool but as it turned out the two datasets that I chose didn't seem to have strong batch effects! I heard about the alignment method sometime last year but was motivated to try it out after listening to Andrew Butler talk about it at the Single Cell Genomics Day. Therefore this post is simply on merging two 10x single cell datasets, namely the PBMC4K and PBMC8K datasets. There is already a merge tutorial but here I show the PCA and t-SNE plots. I also have a Getting started with Seurat post that you can check out if you are unfamiliar with the software.

To get started, download the two filtered cell matrices [click on Gene / cell matrix (filtered)] from the 10x datasets page and put them into two separate folders (e.g. pbmc4k and pbmc8k); this is necessary because the folder names are the same when you extract the tarball. Below are the links to the datasets; you might have to enter your email into a webform when you click the links.

Change into the directories where the files were downloaded and extract them.

cd pbmc4k
tar -xzf pbmc4k_filtered_gene_bc_matrices.tar.gz
cd ../pbmc8k
tar -xzf pbmc8k_filtered_gene_bc_matrices.tar.gz

Load up R and install Seurat if you haven't already.

# install if necessary
install.packages('Seurat')

# load library
library(Seurat)

# version
packageVersion('Seurat')
[1] ‘2.2.0’

There is a Read10X() function that can be used to read in the 10x data. You'll need to specify the path to the matrix, genes, and barcode files for each dataset, i.e. the path to the GRCh38 folder.

pbmc4k.data <- Read10X(data.dir = "~/google_drive/muse/seurat/pbmc4k/filtered_gene_bc_matrices/GRCh38/")
pbmc4k <- CreateSeuratObject(raw.data = pbmc4k.data, project = "PBMC4K")

pbmc8k.data <- Read10X(data.dir = "~/google_drive/muse/seurat/pbmc8k/filtered_gene_bc_matrices/GRCh38/")
pbmc8k <- CreateSeuratObject(raw.data = pbmc8k.data, project = "PBMC8K")

pbmc4k
An object of class seurat in project PBMC4K 
 33694 genes across 4340 samples.

pbmc8k
An object of class seurat in project PBMC8K 
 33694 genes across 8381 samples.

pbmc<- MergeSeurat(object1 = pbmc4k,
                   object2 = pbmc8k,
                   add.cell.id1 = "4K", 
                   add.cell.id2 = "8K",
                   project = "PBMC12K")

pbmc
An object of class seurat in project PBMC12K 
 33694 genes across 12721 samples.

Next we'll filter out cells with high counts of mitochondrial transcripts.

mito.genes <- grep(pattern = "^MT-", x = rownames(x = pbmc@raw.data), value = TRUE)
length(mito.genes)
[1] 13

percent.mito <- Matrix::colSums(pbmc@raw.data[mito.genes, ]) / Matrix::colSums(pbmc@raw.data)

pbmc <- AddMetaData(object = pbmc,
                    metadata = percent.mito,
                    col.name = "percent.mito")

VlnPlot(object = pbmc,
        features.plot = c("nGene", "nUMI", "percent.mito"),
        nCol = 3)

We'll filter out cells with 9% or higher mitochondrial content (a completely arbitrary threshold that I chose).

ggplot(pbmc@meta.data, aes(nUMI, percent.mito)) +
  geom_point(size = 0.5) +
  geom_hline(yintercept = 0.09, linetype = "dashed", colour = "red")

Perform the filtering using the FilterCells() function.

pbmc
An object of class seurat in project PBMC12K 
 33694 genes across 12721 samples.

pbmc <- FilterCells(object = pbmc,
                    subset.names = c("nGene", "percent.mito"),
                    low.thresholds = c(200, -Inf),
                    high.thresholds = c(2500, 0.09))
 
pbmc
An object of class seurat in project PBMC12K 
 33694 genes across 12220 samples.

Next we'll find variable genes for use in the PCA instead of unnecessarily using all the genes. According to the help page:

Setting the y.cutoff parameter to 2 identifies genes that are more than two standard deviations away from the average dispersion within a bin.

I've set this to 0.8 to return around 4,000 genes.

pbmc <- FindVariableGenes(object = pbmc,
                          mean.function = ExpMean,
                          dispersion.function = LogVMR,
                          x.low.cutoff = 0.0125,
                          x.high.cutoff = 3,
                          y.cutoff = 0.8)

length(pbmc@var.genes)
[1] 4269

We'll adjust gene expression by regressing against the UMI count and percent mitochondrial.

pbmc <- ScaleData(object = pbmc,
                  vars.to.regress = c("nUMI", "percent.mito"))

Now to run the PCA.

pbmc <- RunPCA(object = pbmc,
               pc.genes = pbmc@var.genes,
               do.print = TRUE,
               pcs.print = 1:5,
               genes.print = 5)

PCAPlot(object = pbmc, dim.1 = 1, dim.2 = 2)

And finally we run the t-SNE step.

pbmc <- RunTSNE(object = pbmc,
                dims.use = 1:10,
                do.fast = TRUE)
 
TSNEPlot(object = pbmc, do.label = TRUE)

Below I have just separated the two datasets to show that the 8K dataset does indeed have more cells.


my_4k_cell <- grep("^4K", row.names(pbmc@meta.data), value = TRUE)
my_8k_cell <- grep("^8K", row.names(pbmc@meta.data), value = TRUE)
my_4k_tsne <- TSNEPlot(object = pbmc, do.label = TRUE, cells.use = my_4k_cell, do.return = TRUE)
my_8k_tsne <- TSNEPlot(object = pbmc, do.label = TRUE, cells.use = my_8k_cell, do.return = TRUE)
plot_grid(
  my_4k_tsne,
  my_8k_tsne
)

As you can see from the PCA and t-SNE plots, the two datasets overlap each other very well! I realise the two datasets are technical replicates but I was expecting a bit more technical variation.

Print Friendly, PDF & Email



Creative Commons License
This work is licensed under a Creative Commons
Attribution 4.0 International License
.
12 comments Add yours
  1. Hi Davo,

    Thank you for your tutorials for scRNA seq analysis. Your tutorials give me much help!

    I always hear that batch effect is one of the major problems we may face in processing scRNA seq data. And you also mentioned in your article. My question is, how did you know if there is strong batch effect in two (or maybe more) datasets? Also through Seurat?

  2. Hi Dave,

    Is there a possibility to merge multiple (eg 12-15) 10x libraries together? When I try MergeSeurat it doesn’t merge more than 2 at a time right? Any tips would be much appreciated.

    Best,
    Den

  3. Hi Dave,

    Thanks for making such great blogs! I was wondering whether it is necessarily to run NormalizeData() after merging multiple Seurat objects (eight in my case) or not?

    Many thanks,
    Lucy

    1. Hi Lucy,

      the post is outdated and Seurat 3 now uses the merge() function instead of MergeSeurat(). From the documentation:

      When merging Seurat objects, the merge procedure will merge the Assay level counts and potentially the data slots (depending on the merge.data parameter).

      From what I gathered, only the raw counts are merged by default, so you need to perform the normalisation again.

      Cheers,

      Dave

  4. Hi Dave,

    Thanks for making such great blogs.
    IS the merging of two cell data similar like making it replicates??

  5. Hi Dave,
    How to merge count matrices with different number of genes?
    Thank you.
    – Gargi

  6. Thanks Dave!

    This is very useful, I have been looking for a way to merge two Seurat objects and do the dimension reduction in one go. Thank you so much!

    Carrie

  7. Hi Dave,

    This is great. Thank you for all your work. Just a small request do you mind doing an example of integrating multiple ScATAC objects or 10x runs to together because when I did merge I do see batch effects and I hard integration works really well

    1. Hi Sudheshna,

      thanks for the comment. I’ve never worked with scATAC data but would be interested to learn about it. We can work on it together if you’re interested.

      Cheers,
      Dave

      1. Hi Dave,

        Absolutely, I just started working on a project and would love to work on that with you

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.