Getting started with Seurat

This post follows the Peripheral Blood Mononuclear Cells (PBMCs) tutorial for 2,700 single cells. It was written while I was going through the tutorial and contains my notes. The dataset for this tutorial can be downloaded from the 10X Genomics dataset page but it is also hosted on Amazon (see below). The PBMCs, which are primary cells with relatively small amounts of RNA (around 1pg RNA/cell), come from a healthy donor. There were 2,700 cells detected and sequencing was performed on an Illumina NextSeq 500 with around 69,000 reads per cell. To get started install Seurat.

# install if necessary
install.packages("devtools")
library(devtools)

# I'm using R version >= 3.4 on a Mac
install_url("https://github.com/satijalab/seurat/releases/download/v2.0.0/Seurat_2.0.0_R3.4.tgz",
            binary = TRUE)

To follow the tutorial, you need the 10X data.

wget -c https://s3-us-west-2.amazonaws.com/10x.files/samples/cell/pbmc3k/pbmc3k_filtered_gene_bc_matrices.tar.gz

tar -xzf pbmc3k_filtered_gene_bc_matrices.tar.gz
ls -1 filtered_gene_bc_matrices/hg19/
barcodes.tsv
genes.tsv
matrix.mtx

# matrix.mtx is a MatrixMarket file; see http://math.nist.gov/MatrixMarket/formats.html
# only non-zero entries are stored in the file
# comments start with a %, like LaTeX
# the first line indicates the total number of rows, columns, and entries
# the following lines provide a row and column number and the value at that coordinate
head filtered_gene_bc_matrices/hg19/matrix.mtx 
%%MatrixMarket matrix coordinate real general
%
32738 2700 2286884
32709 1 4
32707 1 1
32706 1 10
32704 1 1
32703 1 5
32702 1 6
32700 1 10

Load required libraries and data.

library(Seurat)
library(dplyr)
library(Matrix)

# change this to your working directory
setwd("~/muse/seurat")
pbmc.data <- Read10X(data.dir = "filtered_gene_bc_matrices/hg19/")

# see https://stat.ethz.ch/R-manual/R-devel/library/Matrix/html/dgTMatrix-class.html
class(pbmc.data)
[1] "dgTMatrix"
attr(,"package")
[1] "Matrix"

# 32,738 genes and 2,700 single cells
dim(pbmc.data)
[1] 32738  2700

# check out the first six genes and cells
pbmc.data[1:6, 1:6]
6 x 6 sparse Matrix of class "dgTMatrix"
             AAACATACAACCAC AAACATTGAGCTAC AAACATTGATCAGC AAACCGTGCTTCCG AAACCGTGTATGCG AAACGCACTGGTAC
MIR1302-10                .              .              .              .              .              .
FAM138A                   .              .              .              .              .              .
OR4F5                     .              .              .              .              .              .
RP11-34P13.7              .              .              .              .              .              .
RP11-34P13.8              .              .              .              .              .              .
AL627309.1                .              .              .              .              .              .

# summary of total expression per single cell
summary(colSums(pbmc.data))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    548    1758    2197    2367    2763   15844

# check how many genes have at least one transcript in each cell
at_least_one <- apply(pbmc.data, 2, function(x) sum(x>0))
hist(at_least_one, breaks = 100,
     main = "Distribution of detected genes",
     xlab = "Genes with at least one tag")

hist(colSums(pbmc.data),
     breaks = 100, main = "Expression sum per cell",
     xlab = "Sum expression")

The median number of detected genes among the single cells is 817.

The median sum of expression among the single cells is 2,197. This distribution is very similar to the distribution of detected genes shown above.

We will filter out genes and single cells before we continue with the analysis. The tutorial has arbitrary values of keeping genes expressed in three or more cells and keeping cells with at least 200 detected genes.

# manually check the number of genes detected in three or more cells
# a lot of genes are not detected in 3 or more cells
tmp <- apply(pbmc.data, 1, function(x) sum(x>0))
table(tmp>=3)

FALSE  TRUE 
19024 13714

# all cells have at least 200 detected genes
keep <- tmp>=3
tmp <- pbmc.data[keep,]
at_least_one <- apply(tmp, 2, function(x) sum(x>0))
summary(at_least_one)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  212.0   690.0   816.0   845.5   952.0  3400.0 

dim(tmp)
[1] 13714  2700

pbmc <- CreateSeuratObject(raw.data = pbmc.data,
                           min.cells = 3,
                           min.genes = 200,
                           project = "10X_PBMC")

# see ?seurat for more information on the class
class(pbmc)
[1] "seurat"
attr(,"package")
[1] "Seurat"

# same numbers as above 
pbmc
An object of class seurat in project 10X_PBMC 
 13714 genes across 2700 samples.

slotNames(pbmc)
 [1] "raw.data"     "data"         "scale.data"   "var.genes"    "is.expr"      "ident"        "meta.data"   
 [8] "project.name" "dr"           "assay"        "hvg.info"     "imputed"      "cell.names"   "cluster.tree"
[15] "snn"          "calc.params"  "kmeans"       "spatial"      "misc"         "version"

The tutorial states that "The number of genes and UMIs (nGene and nUMI) are automatically calculated for every object by Seurat." The nUMI is calculated as num.mol <- colSums(object.raw.data), i.e. each transcript is a unique molecule. The number of genes is simply the tally of genes with at least 1 transcript; num.genes <- colSums(object.raw.data > is.expr) where is.expr is zero.

A common quality control metric is the percentage of transcripts from the mitochondrial genome. According to the paper "Classification of low quality cells from single-cell RNA-seq data" the reason this is a quality control metric is because if a single cell is lysed, cytoplasmic RNA will be lost apart from the RNA that is enclosed in the mitochondria, which will be retained and sequenced.

# mitochondria genes conveniently start with MT
mito.genes <- grep(pattern = "^MT-", x = rownames(x = pbmc@data), value = TRUE)
length(mito.genes)
[1] 13
percent.mito <- colSums(pbmc@raw.data[mito.genes, ]) / colSums(pbmc@raw.data)

# check out the meta data
head(pbmc@meta.data)
               nGene nUMI orig.ident
AAACATACAACCAC   781 2421   10X_PBMC
AAACATTGAGCTAC  1352 4903   10X_PBMC
AAACATTGATCAGC  1131 3149   10X_PBMC
AAACCGTGCTTCCG   960 2639   10X_PBMC
AAACCGTGTATGCG   522  981   10X_PBMC
AAACGCACTGGTAC   782 2164   10X_PBMC

# add some more meta data
pbmc <- AddMetaData(object = pbmc,
                    metadata = percent.mito,
                    col.name = "percent.mito")

head(pbmc@meta.data)
               nGene nUMI orig.ident percent.mito
AAACATACAACCAC   781 2421   10X_PBMC  0.030177759
AAACATTGAGCTAC  1352 4903   10X_PBMC  0.037935958
AAACATTGATCAGC  1131 3149   10X_PBMC  0.008897363
AAACCGTGCTTCCG   960 2639   10X_PBMC  0.017430845
AAACCGTGTATGCG   522  981   10X_PBMC  0.012244898
AAACGCACTGGTAC   782 2164   10X_PBMC  0.016643551

# plot number of genes, UMIs, and % mitochondria
VlnPlot(object = pbmc,
        features.plot = c("nGene", "nUMI", "percent.mito"),
        nCol = 3)

A couple of cells have high mitochondrial percentage which may indicate lost of cytoplasmic RNA.

The GenePlot() function can be used to visualise gene-gene relationships as well as any columns in the seurat object. Below we use the plotting function to spot cells that have a high percentage of mitochondrial RNA and to plot the relationship between the number of unique molecules and the number of genes captured.

par(mfrow = c(1, 2))
GenePlot(object = pbmc, gene1 = "nUMI", gene2 = "percent.mito", pch.use = '.')
GenePlot(object = pbmc, gene1 = "nUMI", gene2 = "nGene", pch.use = '.')

Left: There are some clear outliers in mitochondrial RNA vs. the poly-A selected RNA. Right: The more unique molecules captured, the more genes that are probed.

Next we'll use the FilterCells() function to subset the pbmc object based on the number of genes detected in each cell and by the percent mitochondria. Two thresholds need are specified for each filter, a low and a high; -Inf and Inf are used if you only want to specify a lower or upper bound respectively. Cells that have less than 200 genes and more than 2,500 genes and over 5% mitochondrial content are filtered out.

# manual check; I already know all cells have >200 genes
table(pbmc@meta.data$percent.mito < 0.05 & pbmc@meta.data$nGene<2500)

FALSE  TRUE 
   62  2638

# perform the filtering using FilterCells()
pbmc <- FilterCells(object = pbmc,
                    subset.names = c("nGene", "percent.mito"),
                    low.thresholds = c(200, -Inf),
                    high.thresholds = c(2500, 0.05))

# 62 cells are filtered out; numbers consistent with above
pbmc
An object of class seurat in project 10X_PBMC 
 13714 genes across 2638 samples.

The next step is to normalise the data, so that each cell can be compared against each other. At the time of writing, the only normalisation method implemented in Seurat is by log normalisation. Gene expression measurements for each cell are normalised by its total expression, scaled by 10,000, and log-transformed.

hist(colSums(pbmc@data),
     breaks = 100,
     main = "Total expression before normalisation",
     xlab = "Sum of expression")

# currently there is only log-normalisation
# according to the documentation more methods to be included soon
pbmc <- NormalizeData(object = pbmc,
                      normalization.method = "LogNormalize",
                      scale.factor = 1e4)
[1] "Performing log-normalization"
0%   10   20   30   40   50   60   70   80   90   100%
|----|----|----|----|----|----|----|----|----|----|
**************************************************|

hist(colSums(pbmc@data),
     breaks = 100,
     main = "Total expression after normalisation",
     xlab = "Sum of expression")

Once the data is normalised, the next step is to find genes are vary between single cells; genes that are constant among all cells have no distinguishing power. The FindVariableGenes() function calculates the average expression and dispersion for each gene, places these genes into bins, and then calculates a z-score for dispersion within each bin. I interpret that as take each gene, get the average expression and variance of the gene across the 2,638 cells, categorise genes into bins (default is 20) based on their expression and variance, and finally normalise the variance in each bin. This was the same approach in Macosko et al. and new methods for detecting genes with variable expression patterns will be implemented in Seurat soon (according to the tutorial). The parameters used below are typical settings for UMI data that is normalised to a total of 10,000 molecules and will identify around 2,000 variable genes. The tutorial recommends that users should explore the parameters themselves since each dataset is different.

# the variable genes slot is empty before the analysis
pbmc@var.genes
logical(0)

# refer to ?FindVariableGenes
pbmc <- FindVariableGenes(object = pbmc,
                          mean.function = ExpMean,
                          dispersion.function = LogVMR,
                          x.low.cutoff = 0.0125,
                          x.high.cutoff = 3,
                          y.cutoff = 0.5)
[1] "Calculating gene dispersion"
  |====================================================================================================| 100%

# vector of variable genes
head(pbmc@var.genes)
[1] "TNFRSF4"  "CPSF3L"   "ATAD3C"   "C1orf86"  "RER1"     "TNFRSF25"

# number of variable genes
length(pbmc@var.genes)
[1] 1838

# mean and variance of genes are stored pbmc@hvg.info
head(pbmc@hvg.info)
       gene.mean gene.dispersion gene.dispersion.scaled
PPBP   1.2550295        6.286001               5.518037
DOK3   0.2778953        5.838389               8.070503
NFE2L2 0.4518666        5.829069               4.649354
ARVCF  0.1327684        5.807067               8.008689
YPEL2  0.2479784        5.806287               8.007150
UBE2D4 0.2597063        5.779663               7.954609

Values for the x-axis are in pbmc@hvg.info$gene.mean and values for the y-axis are in pbmc@hvg.info$gene.dispersion.scaled.

Seurat constructs linear models to predict gene expression based on user-defined variables to help remove unwanted sources of variation. The idea is that confounding factors, e.g. batch effects and cell cycle stage, affect the observed gene expression patterns and one should adjust for these factors to infer the "correct" gene expression pattern. Buettner et al. demonstrate how correcting for confounding factors improved their downstream analyses.

The example provided in the tutorial used the number of detected molecules per cell and the percentage mitochondrial RNA to build a linear model. The scaled z-scored residuals, i.e. how much the actual expression differs from the linear model, are stored in the scale.data slot, which are used for dimensionality reduction and clustering.

# slot is empty before running ScaleData()
pbmc@scale.data
NULL

# build linear model using nUMI and percent.mito
pbmc <- ScaleData(object = pbmc,
                  vars.to.regress = c("nUMI", "percent.mito"))
[1] "Regressing out nUMI"         "Regressing out percent.mito"
  |====================================================================================================| 100%
[1] "Scaling data matrix"
  |====================================================================================================| 100%

class(pbmc@scale.data)
[1] "matrix"

pbmc@scale.data[1:6, 1:6]
              AAACATACAACCAC AAACATTGAGCTAC AAACATTGATCAGC AAACCGTGCTTCCG AAACCGTGTATGCG AAACGCACTGGTAC
AL627309.1       -0.06547546    -0.10052277    -0.05804007    -0.05845502    -0.03504052   -0.052272255
AP006222.2       -0.02690776    -0.02820169    -0.04508318    -0.03716615    -0.03630436   -0.036391616
RP11-206L10.2    -0.03596234    -0.17689415    -0.09997719    -0.06126849     0.03287205   -0.033625801
RP11-206L10.9    -0.07238741    -0.27313853    -0.03893362    -0.03738245     0.09673455   -0.001697051
LINC00115        -0.08375505    -0.11754780    -0.09141015    -0.08543668    -0.06286247   -0.079036313
NOC2L            -0.31556732    -0.44328035    -0.36509400    -0.33366355    -0.24835175   -0.308864815

The dimension of our dataset is 13714 by 2638; however many genes and single cells are not interesting because they don't vary a lot. One goal of Principal Component Analysis (PCA) is to find the direction/s (usually the first two principal components) in which there is the most variance. The RunPCA() function performs the PCA on genes in the @var.genes slot by default and this can be changed using the pc.genes parameter.

pbmc <- RunPCA(object = pbmc,
               pc.genes = pbmc@var.genes,
               do.print = TRUE,
               pcs.print = 1:5,
               genes.print = 5)
# output not shown here

PrintPCAParams(pbmc)
Parameters used in latest PCA calculation run on: 2017-08-01 09:25:59
=============================================================================
PCs computed    Genes used in calculation    PCs Scaled by Variance Explained
    20                  1838                               TRUE
-----------------------------------------------------------------------------
rev.pca 
 FALSE
-----------------------------------------------------------------------------
Full gene list can be accessed using 
 GetCalcParam(object = object, calculation = "RunPCA", parameter = "pc.genes")

The PrintPCA() function outputs a set of genes that most strongly define a set of principal components.

PrintPCA(object = pbmc, pcs.print = 1:2, genes.print = 5, use.full = FALSE)
[1] "PC1"
[1] "PTPRCAP" "IL32"    "LTB"     "CD2"     "CTSW"   
[1] ""
[1] "CST3"   "TYROBP" "FCN1"   "LST1"   "AIF1"  
[1] ""
[1] ""
[1] "PC2"
[1] "NKG7" "GZMB" "PRF1" "CST7" "GZMA"
[1] ""
[1] "CD79A"    "MS4A1"    "HLA-DQA1" "TCL1A"    "HLA-DQB1"
[1] ""

Furthermore, Seurat has various functions for visualising the cells and genes that define the principal components.

# visualise top genes associated with principal components
VizPCA(object = pbmc, pcs.use = 1:2)

The PCAPlot() function plots the principal components from a PCA; cells are coloured by their identity class according to pbmc@ident.

PCAPlot(object = pbmc, dim.1 = 1, dim.2 = 2)

However, the PCA was only performed on the most variable genes, which is a subset of the dataset. The ProjectPCA step scores each gene in the dataset based on their correlation with the calculated components. This is useful because there may be genes that were not detected as variable genes in the variable gene selection step, which are still strongly correlated with cellular heterogeneity.

# the results of the projected PCA can be explored by setting use.full=TRUE in the functions above
pbmc <- ProjectPCA(object = pbmc, do.print = FALSE)

The PCHeatmap() function produces a heatmap based on the PCA; by default the function uses the first principal component and plots 30 genes across the number of cells specified in cells.use. Setting cells.use to a number plots the "extreme" cells on both ends of the spectrum, which dramatically speeds plotting for large datasets.

PCHeatmap(object = pbmc,
          pc.use = 1,
          cells.use = 500,
          do.balanced = TRUE,
          label.columns = FALSE)

# these warnings can be ignored
Warning messages:
1: In heatmap.2(data.use, Rowv = NA, Colv = NA, trace = "none", col = col.use,  :
  Discrepancy: Rowv is FALSE, while dendrogram is `both'. Omitting row dendogram.
2: In heatmap.2(data.use, Rowv = NA, Colv = NA, trace = "none", col = col.use,  :
  Discrepancy: Colv is FALSE, while dendrogram is `column'. Omitting column dendogram.
3: In plot.window(...) : "dimTitle" is not a graphical parameter
4: In plot.xy(xy, type, ...) : "dimTitle" is not a graphical parameter
5: In title(...) : "dimTitle" is not a graphical parameter

Plotting all cells.

PCHeatmap(object = pbmc,
          pc.use = 1,
          do.balanced = TRUE,
          label.columns = FALSE)

Plotting on 12 principal components.

PCHeatmap(object = pbmc,
          pc.use = 1:12,
          cells.use = 500,
          do.balanced = TRUE,
          label.columns = FALSE)
There were 12 warnings (use warnings() to see them)

The next steps are to determine how many principal components to use in downstream analyses, which is an important step for Seurat. The tutorial goes through two methods: one uses a statistical test based on a random null model, which is time-consuming for large datasets due to the resampling and may not return a clear cutoff and the other is a commonly used heuristic. The statistical test is carried out by the JackStraw() function, which randomly permutes a subset of data, and calculates projected PCA scores for these "random" genes. Then compares the PCA scores for the "random" genes with the observed PCA scores to determine statistical significance. End result is a p-value for each gene's association with each principal component. This resampling test was inspired by the jackstraw procedure. Significant principal components are those with a strong enrichment of low p-value genes.

# NOTE: This process can take a long time for big datasets, comment out for expediency.
# More approximate techniques such as those implemented in PCElbowPlot() can be used to reduce computation time
system.time(
pbmc <- JackStraw(object = pbmc,
                  num.replicate = 100,
                  do.print = FALSE)
)

   user  system elapsed 
248.103  11.050 259.536

The JackStrawPlot() function provides a visualisation tool for comparing the distribution of p-values for each principal component with a uniform distribution (dashed line). "Significant" principal components will show a strong enrichment of genes with low p-values (solid curve above the dashed line).

JackStrawPlot(object = pbmc, PCs = 1:12)
Warning message:
Removed 17469 rows containing missing values (geom_point).

In this case, it appears that principal components 1 to 10 are significant.

Another approach for deciding how many principal components to use is to examine the standard deviations of the principle components, which is performed by the PCElbowPlot() function. A cutoff can be drawn where there is a clear elbow in the graph.

PCElbowPlot(object = pbmc)

It looks like an elbow would fall around principle component 9.

Seurat includes a graph-based clustering approach, which is quite technical. The approach was heavily inspired by recent work that applied graph-based clustering approaches to scRNA-seq data, namely SNN-Cliq and PhenoGraph. Below is the technical description of the approach:

Briefly, these methods embed cells in a graph structure (e.g. a K-nearest neighbour (KNN) graph) with edges drawn between cells with similar gene expression patterns, and then attempt to partition this graph into highly interconnected "quasi-cliques" or "communities." As in PhenoGraph, we first construct a KNN graph based on the Euclidean distance in PCA space, and refine the edge weights between any two cells based on the shared overlap in their local neighbourhoods (Jaccard distance). To cluster the cells, we apply modularity optimisation techniques SLM, to iteratively group cells together, with the goal of optimising the standard modularity function.

The FindClusters() function implements the procedure above. The resolution parameter adjusts the granularity of the clustering with higher values leading to more clusters, i.e. higher granularity. According to the authors of Seurat, setting resolution between 0.6 - 1.2 typically returns good results for datasets with around 3,000 cells. The clusters are saved in the @ident slot of the Seurat object.

# if you get the error
# No Java runtime present, requesting install.
# you need to download and install the Java SE Development Kit
# http://www.oracle.com/technetwork/java/javase/downloads/index.html

# save.SNN = TRUE saves the SNN so that the clustering algorithm can be rerun using the same graph
# but with a different resolution value see ?FindClusters
pbmc <- FindClusters(object = pbmc,
                     reduction.type = "pca",
                     dims.use = 1:10,
                     resolution = 0.6,
                     print.output = 0,
                     save.SNN = TRUE)

# use PrintFindClustersParams() to print summary
# of parameters used to FindClusters()
PrintFindClustersParams(object = pbmc)
Parameters used in latest FindClusters calculation run on: 2017-07-30 13:31:49
=============================================================================
Resolution: 0.6
-----------------------------------------------------------------------------
Modularity Function    Algorithm         n.start         n.iter
     1                   1                 100             10
-----------------------------------------------------------------------------
Reduction used          k.param          k.scale          prune.SNN
     pca                 30                25              0.0667
-----------------------------------------------------------------------------
Dims used in calculation
=============================================================================
1 2 3 4 5 6 7 8 9 10

Seurat can perform t-distributed Stochastic Neighbor Embedding (tSNE) via the RunTSNE() function. According to the authors, the results from the graph based clustering should be similar to the tSNE clustering. This is because the tSNE aims to place cells with similar local neighbourhoods in high-dimensional space together in low-dimensional space.

# the authors suggest using the same PCs as input to the clustering analysis
# although computing the tSNE based on scaled gene expression
# is also supported using the genes.use argument

pbmc <- RunTSNE(object = pbmc,
                dims.use = 1:10,
                do.fast = TRUE)

TSNEPlot(object = pbmc, do.label = TRUE)

The FindMarkers() function identifies positive and negative markers by comparing genes in cells of one cluster against genes in all other cells. The FindAllMarkers() function automates this process for all clusters, but you can also test groups of clusters vs. each other, or against all cells. From the tutorial:

The min.pct argument requires a gene to be detected at a minimum percentage in either of the two groups of cells, and the thresh.test argument requires a gene to be differentially expressed (on average) by some amount between the two groups. You can set both of these to 0, but with a dramatic increase in time - since this will test a large number of genes that are unlikely to be highly discriminatory. As another option to speed up these computations, max.cells.per.ident can be set. This will downsample each identity class to have no more cells than whatever this is set to. While there is generally going to be a loss in power, the speed increases can be significant and the most highly differentially expressed genes will likely still rise to the top.

# find all markers of cluster 1
cluster1.markers <- FindMarkers(object = pbmc,
                                ident.1 = 1,
                                min.pct = 0.25)
   |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed = 07s

head(cluster1.markers)
       p_val avg_diff pct.1 pct.2
S100A9     0 3.827593 0.996 0.216
S100A8     0 3.786535 0.973 0.123
LYZ        0 3.116197 1.000 0.517
LGALS2     0 2.634722 0.908 0.060
FCN1       0 2.369524 0.956 0.150
TYROBP     0 2.106174 0.994 0.266

# find all markers distinguishing cluster 5 from clusters 0 and 3
cluster5.markers <- FindMarkers(object = pbmc,
                                ident.1 = 5,
                                ident.2 = c(0,3),
                                min.pct = 0.25)
   |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed = 06s

head(cluster5.markers)
              p_val   avg_diff pct.1 pct.2
GNLY  3.466387e-171  3.5147178 0.961 0.143
RPS12 4.191523e-170 -1.0195287 0.994 1.000
NKG7  1.266864e-159  2.3523369 1.000 0.307
GZMB  8.416814e-158  3.1950212 0.955 0.084
RPS27 5.655551e-157 -0.9156105 0.987 0.999
RPL13 8.022900e-152 -0.8149199 1.000 1.000

# find markers for every cluster compared to all remaining cells, report only the positive ones
pbmc.markers <- FindAllMarkers(object = pbmc,
                               only.pos = TRUE,
                               min.pct = 0.25,
                               thresh.use = 0.25)
   |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed = 04s
   |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed = 06s
   |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed = 04s
   |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed = 03s
   |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed = 06s
   |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed = 06s
   |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed = 06s
   |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed = 09s

head(pbmc.markers)
              p_val  avg_diff pct.1 pct.2 cluster  gene
LDHB  9.480623e-268 1.1490575 0.924 0.483       0  LDHB
CD3D  1.576099e-240 1.0468542 0.870 0.250       0  CD3D
RPSA  3.436777e-176 0.5154017 0.992 0.933       0  RPSA
RPS29 1.214926e-162 0.6124085 0.990 0.878       0 RPS29
IL32  2.683045e-161 0.6865265 0.836 0.333       0  IL32
LTB   5.232177e-160 0.8968131 0.932 0.528       0   LTB

pbmc.markers %>% group_by(cluster) %>% top_n(2, avg_diff)
# A tibble: 16 x 6
# Groups:   cluster [8]
           p_val avg_diff pct.1 pct.2 cluster     gene
           <dbl>    <dbl> <dbl> <dbl>  <fctr>    <chr>
 1 9.480623e-268 1.149058 0.924 0.483       0     LDHB
 2 3.519020e-133 1.068122 0.662 0.202       0     IL7R
 3  0.000000e+00 3.827593 0.996 0.216       1   S100A9
 4  0.000000e+00 3.786535 0.973 0.123       1   S100A8
 5  0.000000e+00 2.977399 0.936 0.042       2    CD79A
 6 1.030004e-191 2.492236 0.624 0.022       2    TCL1A
 7 3.158793e-231 2.158812 0.974 0.230       3     CCL5
 8 4.103013e-125 2.113428 0.588 0.050       3     GZMK
 9 1.304430e-165 2.151881 1.000 0.316       4     LST1
10 1.420369e-138 2.275509 0.962 0.137       4   FCGR3A
11 1.127318e-215 3.763928 0.961 0.131       5     GNLY
12 8.182907e-187 3.334634 0.955 0.068       5     GZMB
13  3.261141e-48 2.729243 0.844 0.011       6   FCER1A
14  7.655042e-30 1.965168 1.000 0.513       6 HLA-DPB1
15  1.165274e-56 5.889503 1.000 0.023       7     PPBP
16  3.238887e-44 4.952160 0.933 0.010       7      PF4

Seurat has four tests for differential expression (DE) which can be set with the test.use parameter in the FindMarkers() function:

  1. ROC test
  2. t-test
  3. LRT test based on zero-inflated data
  4. LRT test based on tobit-censoring models

Let's compare the four different DE methods for defining cluster 1.

levels(pbmc@ident)
[1] "0" "1" "2" "3" "4" "5" "6" "7"

table(pbmc@ident)

   0    1    2    3    4    5    6    7 
1149  478  343  308  158  155   32   15

my_bimod <- FindMarkers(object = pbmc,
                        ident.1 = 1,
                        thresh.use = 0.25,
                        test.use = "bimod",
                        only.pos = TRUE)

my_roc <- FindMarkers(object = pbmc,
                      ident.1 = 1,
                      thresh.use = 0.25,
                      test.use = "roc",
                      only.pos = TRUE)

my_t <- FindMarkers(object = pbmc,
                    ident.1 = 1,
                    thresh.use = 0.25,
                    test.use = "t",
                    only.pos = TRUE)

my_tobit <- FindMarkers(object = pbmc,
                        ident.1 = 1,
                        thresh.use = 0.25,
                        test.use = "tobit",
                        only.pos = TRUE)

# identical set of genes
dim(my_bimod)
[1] 570   4
dim(my_roc)
[1] 570   5
dim(my_t)
[1] 570   4
dim(my_tobit)
[1] 570   4

# the rankings of the genes are quite similar between the methods
my_gene <- row.names(my_bimod)
a <- 1:length(my_gene)
b <- match(my_gene, row.names(my_roc))
c <- match(my_gene, row.names(my_t))
d <- match(my_gene, row.names(my_tobit))

# bimod vs. bimod
cor(a, a, method = "spearman")
[1] 1

# bimod vs. roc
cor(a, b, method = "spearman")
[1] 0.880193

# bimod vs. t
cor(a, c, method = "spearman")
[1] 0.9575922

# bimod vs. tobit
cor(a, d, method = "spearman")
[1] 0.9759479

par(mfrow=c(2,2))
barplot(a, main = 'bimod')
barplot(b, main = 'roc')
barplot(c, main = 't')
barplot(d, main = 'tobit')

The VlnPlot() and FeaturePlot() functions can be used to visualise marker expression.

VlnPlot(object = pbmc, features.plot = c("MS4A1", "CD79A"))

# visualise markers found by the FindMarkers() analysis above
head(my_tobit)
       p_val avg_diff pct.1 pct.2
S100A9     0 3.827593 0.996 0.216
S100A8     0 3.786535 0.973 0.123
LYZ        0 3.116197 1.000 0.517
LGALS2     0 2.634722 0.908 0.060
FCN1       0 2.369524 0.956 0.150
FTL        0 1.804296 1.000 0.988

VlnPlot(object = pbmc, features.plot = c("S100A9", "S100A8"))

# you can plot raw UMI counts as well
VlnPlot(object = pbmc,
        features.plot = c("NKG7", "PF4"),
        use.raw = TRUE,
        y.log = TRUE)

FeaturePlot(object = pbmc,
            features.plot = c("MS4A1", "GNLY", "CD3E", "CD14", "FCER1A", "FCGR3A", "LYZ", "PPBP", "CD8A"),
            cols.use = c("grey", "blue"),
            reduction.use = "tsne")

FeaturePlot(object = pbmc,
            features.plot = head(row.names(my_tobit), 9),
            cols.use = c("grey", "blue"))

The DoHeatmap() function creates a heatmap of genes across all cells. Below, we use plot the top 10 marker genes for the eight clusters.

head(pbmc.markers)
              p_val  avg_diff pct.1 pct.2 cluster  gene
LDHB  9.480623e-268 1.1490575 0.924 0.483       0  LDHB
CD3D  1.576099e-240 1.0468542 0.870 0.250       0  CD3D
RPSA  3.436777e-176 0.5154017 0.992 0.933       0  RPSA
RPS29 1.214926e-162 0.6124085 0.990 0.878       0 RPS29
IL32  2.683045e-161 0.6865265 0.836 0.333       0  IL32
LTB   5.232177e-160 0.8968131 0.932 0.528       0   LTB

pbmc.markers %>%
  group_by(cluster) %>%
  top_n(10, avg_diff) -> top10

head(top10)
# A tibble: 6 x 6
# Groups:   cluster [1]
          p_val  avg_diff pct.1 pct.2 cluster  gene
          <dbl>     <dbl> <dbl> <dbl>  <fctr> <chr>
1 9.480623e-268 1.1490575 0.924 0.483       0  LDHB
2 1.576099e-240 1.0468542 0.870 0.250       0  CD3D
3 5.232177e-160 0.8968131 0.932 0.528       0   LTB
4 1.254598e-154 0.9722549 0.768 0.267       0  CD3E
5 3.519020e-133 1.0681216 0.662 0.202       0  IL7R
6 6.578712e-106 0.8980251 0.649 0.259       0 NOSIP

DoHeatmap(object = pbmc,
          genes.use = top10$gene,
          order.by.ident = TRUE,
          slim.col.label = TRUE,
          remove.key = TRUE)

There are canonical markers for each cell type, which can be used to assess the clustering.

Cluster ID Markers Cell Type
0 IL7R CD4 T cells
1 CD14, LYZ CD14+ Monocytes
2 MS4A1 B cells
3 CD8A CD8 T cells
4 FCGR3A, MS4A7 FCGR3A+ Monocytes
5 GNLY, NKG7 NK cells
6 FCER1A, CST3 Dendritic Cells
7 PPBP Megakaryocytes
FeaturePlot(object = pbmc,
            features.plot = c("IL7R", "CD14", "LYZ", "MS4A1", "CD8A", "FCGR3A", "MS4A7", "GNLY", "NKG7", "FCER1A", "CST3", "PPBP"),
            cols.use = c("grey", "blue"),
            reduction.use = "tsne")

current.cluster.ids <- c(0, 1, 2, 3, 4, 5, 6, 7)
new.cluster.ids <- c("CD4 T cells",
                     "CD14+ Monocytes",
                     "B cells",
                     "CD8 T cells",
                     "FCGR3A+ Monocytes",
                     "NK cells",
                     "Dendritic cells",
                     "Megakaryocytes")

pbmc@ident <- plyr::mapvalues(x = pbmc@ident,
                              from = current.cluster.ids,
                              to = new.cluster.ids)

TSNEPlot(object = pbmc, do.label = TRUE, pt.size = 0.5)

Seurat provides the StashIdent() function for keeping cluster IDs; this is useful for testing various parameters and comparing the clusters. For example, adjusting the parameters may lead to the CD4 T cells subdividing into two groups.

# stash cluster identities for later
pbmc <- StashIdent(object = pbmc, save.name = "ClusterNames_0.6")

pbmc <- FindClusters(object = pbmc,
                     reduction.type = "pca",
                     dims.use = 1:10,
                     resolution = 0.8,
                     print.output = FALSE)

# plot two tSNE plots side by side, and colour points based on different criteria
plot1 <- TSNEPlot(object = pbmc,
                  do.return = TRUE,
                  no.legend = TRUE,
                  do.label = TRUE)

plot2 <- TSNEPlot(object = pbmc,
                  do.return = TRUE,
                  group.by = "ClusterNames_0.6",
                  no.legend = TRUE,
                  do.label = TRUE)

plot_grid(plot1, plot2)

Print Friendly, PDF & Email



Creative Commons License
This work is licensed under a Creative Commons
Attribution 4.0 International License
.

Leave a Reply

Your email address will not be published. Required fields are marked *

Time limit is exhausted. Please reload CAPTCHA.