Getting started with Seurat

This post is outdated; please refer to the official Seurat vignettes for more information.

This post follows the Peripheral Blood Mononuclear Cells (PBMCs) tutorial for 2,700 single cells. It was written while I was going through the tutorial and contains my notes. The dataset for this tutorial can be downloaded from the 10X Genomics dataset page but it is also hosted on Amazon (see below). The PBMCs, which are primary cells with relatively small amounts of RNA (around 1pg RNA/cell), come from a healthy donor. There were 2,700 cells detected and sequencing was performed on an Illumina NextSeq 500 with around 69,000 reads per cell. To get started install Seurat by using install.packages().


To follow the tutorial, you need the 10X data.

wget -c

tar -xzf pbmc3k_filtered_gene_bc_matrices.tar.gz
ls -1 filtered_gene_bc_matrices/hg19/

# matrix.mtx is a MatrixMarket file; see
# only non-zero entries are stored in the file
# comments start with a %, like LaTeX
# the first line indicates the total number of rows, columns, and entries
# the following lines provide a row and column number and the value at that coordinate
head filtered_gene_bc_matrices/hg19/matrix.mtx 
%%MatrixMarket matrix coordinate real general
32738 2700 2286884
32709 1 4
32707 1 1
32706 1 10
32704 1 1
32703 1 5
32702 1 6
32700 1 10

Load required libraries and data.


# change this to your working directory
setwd("~/muse/seurat") <- Read10X(data.dir = "filtered_gene_bc_matrices/hg19/")

# see
[1] "dgTMatrix"
[1] "Matrix"

# 32,738 genes and 2,700 single cells
[1] 32738  2700

# check out the first six genes and cells[1:6, 1:6]
6 x 6 sparse Matrix of class "dgTMatrix"
MIR1302-10                .              .              .              .              .              .
FAM138A                   .              .              .              .              .              .
OR4F5                     .              .              .              .              .              .
RP11-34P13.7              .              .              .              .              .              .
RP11-34P13.8              .              .              .              .              .              .
AL627309.1                .              .              .              .              .              .

# summary of total expression per single cell
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    548    1758    2197    2367    2763   15844

# check how many genes have at least one transcript in each cell
at_least_one <- apply(, 2, function(x) sum(x>0))
hist(at_least_one, breaks = 100,
     main = "Distribution of detected genes",
     xlab = "Genes with at least one tag")

     breaks = 100, main = "Expression sum per cell",
     xlab = "Sum expression")

The median number of detected genes among the single cells is 817.

The median sum of expression among the single cells is 2,197. This distribution is very similar to the distribution of detected genes shown above.

We will filter out genes and single cells before we continue with the analysis. The tutorial has arbitrary values of keeping genes expressed in three or more cells and keeping cells with at least 200 detected genes.

# manually check the number of genes detected in three or more cells
# a lot of genes are not detected in 3 or more cells
tmp <- apply(, 1, function(x) sum(x>0))

19024 13714

# all cells have at least 200 detected genes
keep <- tmp>=3
tmp <-[keep,]
at_least_one <- apply(tmp, 2, function(x) sum(x>0))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  212.0   690.0   816.0   845.5   952.0  3400.0 

[1] 13714  2700

pbmc <- CreateSeuratObject( =,
                           min.cells = 3,
                           min.genes = 200,
                           project = "10X_PBMC")

# see ?seurat for more information on the class
[1] "seurat"
[1] "Seurat"

# same numbers as above 
An object of class seurat in project 10X_PBMC 
 13714 genes across 2700 samples.

 [1] ""     "data"         ""   "var.genes"    "is.expr"      "ident"        ""   
 [8] "" "dr"           "assay"        ""     "imputed"      "cell.names"   "cluster.tree"
[15] "snn"          "calc.params"  "kmeans"       "spatial"      "misc"         "version"

The tutorial states that "The number of genes and UMIs (nGene and nUMI) are automatically calculated for every object by Seurat." The nUMI is calculated as num.mol <- colSums(, i.e. each transcript is a unique molecule. The number of genes is simply the tally of genes with at least 1 transcript; num.genes <- colSums( > is.expr) where is.expr is zero.

A common quality control metric is the percentage of transcripts from the mitochondrial genome. According to the paper "Classification of low quality cells from single-cell RNA-seq data" the reason this is a quality control metric is because if a single cell is lysed, cytoplasmic RNA will be lost apart from the RNA that is enclosed in the mitochondria, which will be retained and sequenced.

# mitochondria genes conveniently start with MT
mito.genes <- grep(pattern = "^MT-", x = rownames(x = pbmc@data), value = TRUE)
[1] 13
percent.mito <- Matrix::colSums([mito.genes, ]) / Matrix::colSums(

# check out the meta data
               nGene nUMI orig.ident

# add some more meta data
pbmc <- AddMetaData(object = pbmc,
                    metadata = percent.mito,
           = "percent.mito")

               nGene nUMI orig.ident percent.mito
AAACATACAACCAC   781 2421   10X_PBMC  0.030177759
AAACATTGAGCTAC  1352 4903   10X_PBMC  0.037935958
AAACATTGATCAGC  1131 3149   10X_PBMC  0.008897363
AAACCGTGCTTCCG   960 2639   10X_PBMC  0.017430845
AAACCGTGTATGCG   522  981   10X_PBMC  0.012244898
AAACGCACTGGTAC   782 2164   10X_PBMC  0.016643551

# plot number of genes, UMIs, and % mitochondria
VlnPlot(object = pbmc,
        features.plot = c("nGene", "nUMI", "percent.mito"),
        nCol = 3)

A couple of cells have high mitochondrial percentage which may indicate lost of cytoplasmic RNA.

The GenePlot() function can be used to visualise gene-gene relationships as well as any columns in the seurat object. Below we use the plotting function to spot cells that have a high percentage of mitochondrial RNA and to plot the relationship between the number of unique molecules and the number of genes captured.

par(mfrow = c(1, 2))
GenePlot(object = pbmc, gene1 = "nUMI", gene2 = "percent.mito", pch.use = '.')
GenePlot(object = pbmc, gene1 = "nUMI", gene2 = "nGene", pch.use = '.')

Left: There are some clear outliers in mitochondrial RNA vs. the poly-A selected RNA. Right: The more unique molecules captured, the more genes that are probed.

Next we'll use the FilterCells() function to subset the pbmc object based on the number of genes detected in each cell and by the percent mitochondria. Two thresholds need are specified for each filter, a low and a high; -Inf and Inf are used if you only want to specify a lower or upper bound respectively. Inf is an abbreviation of infinity; we use it to set the highest (Inf) or lowest possible (-Inf) bounds. Cells that have less than 200 genes and more than 2,500 genes and over 5% mitochondrial content are filtered out.

# manual check; I already know all cells have >200 genes
table($percent.mito < 0.05 &$nGene<2500)

   62  2638

# perform the filtering using FilterCells()
pbmc <- FilterCells(object = pbmc,
                    subset.names = c("nGene", "percent.mito"),
                    low.thresholds = c(200, -Inf),
                    high.thresholds = c(2500, 0.05))

# 62 cells are filtered out; numbers consistent with above
An object of class seurat in project 10X_PBMC 
 13714 genes across 2638 samples.

The next step is to normalise the data, so that each cell can be compared against each other. At the time of writing, the only normalisation method implemented in Seurat is by log normalisation. Gene expression measurements for each cell are normalised by its total expression, scaled by 10,000, and log-transformed.

     breaks = 100,
     main = "Total expression before normalisation",
     xlab = "Sum of expression")

# currently there is only log-normalisation
# according to the documentation more methods to be included soon
pbmc <- NormalizeData(object = pbmc,
                      normalization.method = "LogNormalize",
                      scale.factor = 1e4)
[1] "Performing log-normalization"
0%   10   20   30   40   50   60   70   80   90   100%

     breaks = 100,
     main = "Total expression after normalisation",
     xlab = "Sum of expression")

Once the data is normalised, the next step is to find genes are vary between single cells; genes that are constant among all cells have no distinguishing power. The FindVariableGenes() function calculates the average expression and dispersion for each gene, places these genes into bins, and then calculates a z-score for dispersion within each bin. I interpret that as take each gene, get the average expression and variance of the gene across the 2,638 cells, categorise genes into bins (default is 20) based on their expression and variance, and finally normalise the variance in each bin. This was the same approach in Macosko et al. and new methods for detecting genes with variable expression patterns will be implemented in Seurat soon (according to the tutorial). The parameters used below are typical settings for UMI data that is normalised to a total of 10,000 molecules and will identify around 2,000 variable genes. The tutorial recommends that users should explore the parameters themselves since each dataset is different.

# the variable genes slot is empty before the analysis

# refer to ?FindVariableGenes
pbmc <- FindVariableGenes(object = pbmc,
                          mean.function = ExpMean,
                          dispersion.function = LogVMR,
                          x.low.cutoff = 0.0125,
                          x.high.cutoff = 3,
                          y.cutoff = 0.5)
[1] "Calculating gene dispersion"
  |====================================================================================================| 100%

# vector of variable genes
[1] "TNFRSF4"  "CPSF3L"   "ATAD3C"   "C1orf86"  "RER1"     "TNFRSF25"

# number of variable genes
[1] 1838

# mean and variance of genes are stored
       gene.mean gene.dispersion gene.dispersion.scaled
PPBP   1.2550295        6.286001               5.518037
DOK3   0.2778953        5.838389               8.070503
NFE2L2 0.4518666        5.829069               4.649354
ARVCF  0.1327684        5.807067               8.008689
YPEL2  0.2479784        5.806287               8.007150
UBE2D4 0.2597063        5.779663               7.954609

Values for the x-axis are in$gene.mean and values for the y-axis are in$gene.dispersion.scaled.

Seurat constructs linear models to predict gene expression based on user-defined variables to help remove unwanted sources of variation. The idea is that confounding factors, e.g. batch effects and cell cycle stage, affect the observed gene expression patterns and one should adjust for these factors to infer the "correct" gene expression pattern. Buettner et al. demonstrate how correcting for confounding factors improved their downstream analyses.

The example provided in the tutorial used the number of detected molecules per cell and the percentage mitochondrial RNA to build a linear model. The scaled z-scored residuals, i.e. how much the actual expression differs from the linear model, are stored in the slot, which are used for dimensionality reduction and clustering.

# slot is empty before running ScaleData()

# build linear model using nUMI and percent.mito
pbmc <- ScaleData(object = pbmc,
         = c("nUMI", "percent.mito"))
[1] "Regressing out nUMI"         "Regressing out percent.mito"
  |====================================================================================================| 100%
[1] "Scaling data matrix"
  |====================================================================================================| 100%

[1] "matrix"[1:6, 1:6]
AL627309.1       -0.06547546    -0.10052277    -0.05804007    -0.05845502    -0.03504052   -0.052272255
AP006222.2       -0.02690776    -0.02820169    -0.04508318    -0.03716615    -0.03630436   -0.036391616
RP11-206L10.2    -0.03596234    -0.17689415    -0.09997719    -0.06126849     0.03287205   -0.033625801
RP11-206L10.9    -0.07238741    -0.27313853    -0.03893362    -0.03738245     0.09673455   -0.001697051
LINC00115        -0.08375505    -0.11754780    -0.09141015    -0.08543668    -0.06286247   -0.079036313
NOC2L            -0.31556732    -0.44328035    -0.36509400    -0.33366355    -0.24835175   -0.308864815

The dimension of our dataset is 13714 by 2638; however many genes and single cells are not interesting because they don't vary a lot. One goal of Principal Component Analysis (PCA) is to find the direction/s (usually the first two principal components) in which there is the most variance. The RunPCA() function performs the PCA on genes in the @var.genes slot by default and this can be changed using the pc.genes parameter.

pbmc <- RunPCA(object = pbmc,
               pc.genes = pbmc@var.genes,
               do.print = TRUE,
               pcs.print = 1:5,
               genes.print = 5)
# output not shown here

Parameters used in latest PCA calculation run on: 2017-08-01 09:25:59
PCs computed    Genes used in calculation    PCs Scaled by Variance Explained
    20                  1838                               TRUE
Full gene list can be accessed using 
 GetCalcParam(object = object, calculation = "RunPCA", parameter = "pc.genes")

The PrintPCA() function outputs a set of genes that most strongly define a set of principal components.

PrintPCA(object = pbmc, pcs.print = 1:2, genes.print = 5, use.full = FALSE)
[1] "PC1"
[1] "PTPRCAP" "IL32"    "LTB"     "CD2"     "CTSW"   
[1] ""
[1] "CST3"   "TYROBP" "FCN1"   "LST1"   "AIF1"  
[1] ""
[1] ""
[1] "PC2"
[1] "NKG7" "GZMB" "PRF1" "CST7" "GZMA"
[1] ""
[1] "CD79A"    "MS4A1"    "HLA-DQA1" "TCL1A"    "HLA-DQB1"
[1] ""

Furthermore, Seurat has various functions for visualising the cells and genes that define the principal components.

# visualise top genes associated with principal components
VizPCA(object = pbmc, pcs.use = 1:2)

The PCAPlot() function plots the principal components from a PCA; cells are coloured by their identity class according to pbmc@ident.

PCAPlot(object = pbmc, dim.1 = 1, dim.2 = 2)

However, the PCA was only performed on the most variable genes, which is a subset of the dataset. The ProjectPCA step scores each gene in the dataset based on their correlation with the calculated components. This is useful because there may be genes that were not detected as variable genes in the variable gene selection step, which are still strongly correlated with cellular heterogeneity.

# the results of the projected PCA can be explored by setting use.full=TRUE in the functions above
pbmc <- ProjectPCA(object = pbmc, do.print = FALSE)

The PCHeatmap() function produces a heatmap based on the PCA; by default the function uses the first principal component and plots 30 genes across the number of cells specified in cells.use. Setting cells.use to a number plots the "extreme" cells on both ends of the spectrum, which dramatically speeds plotting for large datasets.

PCHeatmap(object = pbmc,
          pc.use = 1,
          cells.use = 500,
          do.balanced = TRUE,
          label.columns = FALSE)

# these warnings can be ignored
Warning messages:
1: In heatmap.2(data.use, Rowv = NA, Colv = NA, trace = "none", col = col.use,  :
  Discrepancy: Rowv is FALSE, while dendrogram is `both'. Omitting row dendogram.
2: In heatmap.2(data.use, Rowv = NA, Colv = NA, trace = "none", col = col.use,  :
  Discrepancy: Colv is FALSE, while dendrogram is `column'. Omitting column dendogram.
3: In plot.window(...) : "dimTitle" is not a graphical parameter
4: In plot.xy(xy, type, ...) : "dimTitle" is not a graphical parameter
5: In title(...) : "dimTitle" is not a graphical parameter

Plotting all cells.

PCHeatmap(object = pbmc,
          pc.use = 1,
          do.balanced = TRUE,
          label.columns = FALSE)

Plotting on 12 principal components.

PCHeatmap(object = pbmc,
          pc.use = 1:12,
          cells.use = 500,
          do.balanced = TRUE,
          label.columns = FALSE)
There were 12 warnings (use warnings() to see them)

The next steps are to determine how many principal components to use in downstream analyses, which is an important step for Seurat. The tutorial goes through two methods: one uses a statistical test based on a random null model, which is time-consuming for large datasets due to the resampling and may not return a clear cutoff and the other is a commonly used heuristic. The statistical test is carried out by the JackStraw() function, which randomly permutes a subset of data, and calculates projected PCA scores for these "random" genes. Then compares the PCA scores for the "random" genes with the observed PCA scores to determine statistical significance. End result is a p-value for each gene's association with each principal component. This resampling test was inspired by the jackstraw procedure. Significant principal components are those with a strong enrichment of low p-value genes.

# NOTE: This process can take a long time for big datasets, comment out for expediency.
# More approximate techniques such as those implemented in PCElbowPlot() can be used to reduce computation time
pbmc <- JackStraw(object = pbmc,
                  num.replicate = 100,
                  do.print = FALSE)

   user  system elapsed 
248.103  11.050 259.536

The JackStrawPlot() function provides a visualisation tool for comparing the distribution of p-values for each principal component with a uniform distribution (dashed line). "Significant" principal components will show a strong enrichment of genes with low p-values (solid curve above the dashed line).

JackStrawPlot(object = pbmc, PCs = 1:12)
Warning message:
Removed 17469 rows containing missing values (geom_point).

In this case, it appears that principal components 1 to 10 are significant.

Another approach for deciding how many principal components to use is to examine the standard deviations of the principle components, which is performed by the PCElbowPlot() function. A cutoff can be drawn where there is a clear elbow in the graph.

PCElbowPlot(object = pbmc)

It looks like an elbow would fall around principle component 9.

Seurat includes a graph-based clustering approach, which is quite technical. The approach was heavily inspired by recent work that applied graph-based clustering approaches to scRNA-seq data, namely SNN-Cliq and PhenoGraph. Below is the technical description of the approach:

Briefly, these methods embed cells in a graph structure (e.g. a K-nearest neighbour (KNN) graph) with edges drawn between cells with similar gene expression patterns, and then attempt to partition this graph into highly interconnected "quasi-cliques" or "communities." As in PhenoGraph, we first construct a KNN graph based on the Euclidean distance in PCA space, and refine the edge weights between any two cells based on the shared overlap in their local neighbourhoods (Jaccard distance). To cluster the cells, we apply modularity optimisation techniques SLM, to iteratively group cells together, with the goal of optimising the standard modularity function.

The FindClusters() function implements the procedure above. The resolution parameter adjusts the granularity of the clustering with higher values leading to more clusters, i.e. higher granularity. According to the authors of Seurat, setting resolution between 0.6 - 1.2 typically returns good results for datasets with around 3,000 cells. The clusters are saved in the @ident slot of the Seurat object.

# if you get the error
# No Java runtime present, requesting install.
# you need to download and install the Java SE Development Kit

# save.SNN = TRUE saves the SNN so that the clustering algorithm can be rerun using the same graph
# but with a different resolution value see ?FindClusters
pbmc <- FindClusters(object = pbmc,
                     reduction.type = "pca",
                     dims.use = 1:10,
                     resolution = 0.6,
                     print.output = 0,
                     save.SNN = TRUE)

# use PrintFindClustersParams() to print summary
# of parameters used to FindClusters()
PrintFindClustersParams(object = pbmc)
Parameters used in latest FindClusters calculation run on: 2017-07-30 13:31:49
Resolution: 0.6
Modularity Function    Algorithm         n.start         n.iter
     1                   1                 100             10
Reduction used          k.param          k.scale          prune.SNN
     pca                 30                25              0.0667
Dims used in calculation
1 2 3 4 5 6 7 8 9 10

Seurat can perform t-distributed Stochastic Neighbor Embedding (tSNE) via the RunTSNE() function. According to the authors, the results from the graph based clustering should be similar to the tSNE clustering. This is because the tSNE aims to place cells with similar local neighbourhoods in high-dimensional space together in low-dimensional space.

# the authors suggest using the same PCs as input to the clustering analysis
# although computing the tSNE based on scaled gene expression
# is also supported using the genes.use argument

pbmc <- RunTSNE(object = pbmc,
                dims.use = 1:10,
       = TRUE)

TSNEPlot(object = pbmc, do.label = TRUE)

The FindMarkers() function identifies positive and negative markers by comparing genes in cells of one cluster against genes in all other cells. The FindAllMarkers() function automates this process for all clusters, but you can also test groups of clusters vs. each other, or against all cells. From the tutorial:

The min.pct argument requires a gene to be detected at a minimum percentage in either of the two groups of cells, and the thresh.test argument requires a gene to be differentially expressed (on average) by some amount between the two groups. You can set both of these to 0, but with a dramatic increase in time - since this will test a large number of genes that are unlikely to be highly discriminatory. As another option to speed up these computations, max.cells.per.ident can be set. This will downsample each identity class to have no more cells than whatever this is set to. While there is generally going to be a loss in power, the speed increases can be significant and the most highly differentially expressed genes will likely still rise to the top.

# find all markers of cluster 1
cluster1.markers <- FindMarkers(object = pbmc,
                                ident.1 = 1,
                                min.pct = 0.25)
   |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed = 07s

       p_val avg_diff pct.1 pct.2
S100A9     0 3.827593 0.996 0.216
S100A8     0 3.786535 0.973 0.123
LYZ        0 3.116197 1.000 0.517
LGALS2     0 2.634722 0.908 0.060
FCN1       0 2.369524 0.956 0.150
TYROBP     0 2.106174 0.994 0.266

# find all markers distinguishing cluster 5 from clusters 0 and 3
cluster5.markers <- FindMarkers(object = pbmc,
                                ident.1 = 5,
                                ident.2 = c(0,3),
                                min.pct = 0.25)
   |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed = 06s

              p_val   avg_diff pct.1 pct.2
GNLY  3.466387e-171  3.5147178 0.961 0.143
RPS12 4.191523e-170 -1.0195287 0.994 1.000
NKG7  1.266864e-159  2.3523369 1.000 0.307
GZMB  8.416814e-158  3.1950212 0.955 0.084
RPS27 5.655551e-157 -0.9156105 0.987 0.999
RPL13 8.022900e-152 -0.8149199 1.000 1.000

# find markers for every cluster compared to all remaining cells, report only the positive ones
pbmc.markers <- FindAllMarkers(object = pbmc,
                               only.pos = TRUE,
                               min.pct = 0.25,
                               thresh.use = 0.25)
   |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed = 04s
   |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed = 06s
   |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed = 04s
   |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed = 03s
   |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed = 06s
   |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed = 06s
   |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed = 06s
   |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed = 09s

              p_val  avg_diff pct.1 pct.2 cluster  gene
LDHB  9.480623e-268 1.1490575 0.924 0.483       0  LDHB
CD3D  1.576099e-240 1.0468542 0.870 0.250       0  CD3D
RPSA  3.436777e-176 0.5154017 0.992 0.933       0  RPSA
RPS29 1.214926e-162 0.6124085 0.990 0.878       0 RPS29
IL32  2.683045e-161 0.6865265 0.836 0.333       0  IL32
LTB   5.232177e-160 0.8968131 0.932 0.528       0   LTB

pbmc.markers %>% group_by(cluster) %>% top_n(2, avg_diff)
# A tibble: 16 x 6
# Groups:   cluster [8]
           p_val avg_diff pct.1 pct.2 cluster     gene
           <dbl>    <dbl> <dbl> <dbl>  <fctr>    <chr>
 1 9.480623e-268 1.149058 0.924 0.483       0     LDHB
 2 3.519020e-133 1.068122 0.662 0.202       0     IL7R
 3  0.000000e+00 3.827593 0.996 0.216       1   S100A9
 4  0.000000e+00 3.786535 0.973 0.123       1   S100A8
 5  0.000000e+00 2.977399 0.936 0.042       2    CD79A
 6 1.030004e-191 2.492236 0.624 0.022       2    TCL1A
 7 3.158793e-231 2.158812 0.974 0.230       3     CCL5
 8 4.103013e-125 2.113428 0.588 0.050       3     GZMK
 9 1.304430e-165 2.151881 1.000 0.316       4     LST1
10 1.420369e-138 2.275509 0.962 0.137       4   FCGR3A
11 1.127318e-215 3.763928 0.961 0.131       5     GNLY
12 8.182907e-187 3.334634 0.955 0.068       5     GZMB
13  3.261141e-48 2.729243 0.844 0.011       6   FCER1A
14  7.655042e-30 1.965168 1.000 0.513       6 HLA-DPB1
15  1.165274e-56 5.889503 1.000 0.023       7     PPBP
16  3.238887e-44 4.952160 0.933 0.010       7      PF4

Seurat has four tests for differential expression (DE) which can be set with the test.use parameter in the FindMarkers() function:

  1. ROC test
  2. t-test
  3. LRT test based on zero-inflated data
  4. LRT test based on tobit-censoring models

Let's compare the four different DE methods for defining cluster 1.

[1] "0" "1" "2" "3" "4" "5" "6" "7"


   0    1    2    3    4    5    6    7 
1149  478  343  308  158  155   32   15

my_bimod <- FindMarkers(object = pbmc,
                        ident.1 = 1,
                        thresh.use = 0.25,
                        test.use = "bimod",
                        only.pos = TRUE)

my_roc <- FindMarkers(object = pbmc,
                      ident.1 = 1,
                      thresh.use = 0.25,
                      test.use = "roc",
                      only.pos = TRUE)

my_t <- FindMarkers(object = pbmc,
                    ident.1 = 1,
                    thresh.use = 0.25,
                    test.use = "t",
                    only.pos = TRUE)

my_tobit <- FindMarkers(object = pbmc,
                        ident.1 = 1,
                        thresh.use = 0.25,
                        test.use = "tobit",
                        only.pos = TRUE)

# identical set of genes
[1] 570   4
[1] 570   5
[1] 570   4
[1] 570   4

# the rankings of the genes are quite similar between the methods
my_gene <- row.names(my_bimod)
a <- 1:length(my_gene)
b <- match(my_gene, row.names(my_roc))
c <- match(my_gene, row.names(my_t))
d <- match(my_gene, row.names(my_tobit))

# bimod vs. bimod
cor(a, a, method = "spearman")
[1] 1

# bimod vs. roc
cor(a, b, method = "spearman")
[1] 0.880193

# bimod vs. t
cor(a, c, method = "spearman")
[1] 0.9575922

# bimod vs. tobit
cor(a, d, method = "spearman")
[1] 0.9759479

barplot(a, main = 'bimod')
barplot(b, main = 'roc')
barplot(c, main = 't')
barplot(d, main = 'tobit')

The VlnPlot() and FeaturePlot() functions can be used to visualise marker expression.

VlnPlot(object = pbmc, features.plot = c("MS4A1", "CD79A"))

# visualise markers found by the FindMarkers() analysis above
       p_val avg_diff pct.1 pct.2
S100A9     0 3.827593 0.996 0.216
S100A8     0 3.786535 0.973 0.123
LYZ        0 3.116197 1.000 0.517
LGALS2     0 2.634722 0.908 0.060
FCN1       0 2.369524 0.956 0.150
FTL        0 1.804296 1.000 0.988

VlnPlot(object = pbmc, features.plot = c("S100A9", "S100A8"))

# you can plot raw UMI counts as well
VlnPlot(object = pbmc,
        features.plot = c("NKG7", "PF4"),
        use.raw = TRUE,
        y.log = TRUE)

FeaturePlot(object = pbmc,
            features.plot = c("MS4A1", "GNLY", "CD3E", "CD14", "FCER1A", "FCGR3A", "LYZ", "PPBP", "CD8A"),
            cols.use = c("grey", "blue"),
            reduction.use = "tsne")

FeaturePlot(object = pbmc,
            features.plot = head(row.names(my_tobit), 9),
            cols.use = c("grey", "blue"))

The DoHeatmap() function creates a heatmap of genes across all cells. Below, we use plot the top 10 marker genes for the eight clusters.

              p_val  avg_diff pct.1 pct.2 cluster  gene
LDHB  9.480623e-268 1.1490575 0.924 0.483       0  LDHB
CD3D  1.576099e-240 1.0468542 0.870 0.250       0  CD3D
RPSA  3.436777e-176 0.5154017 0.992 0.933       0  RPSA
RPS29 1.214926e-162 0.6124085 0.990 0.878       0 RPS29
IL32  2.683045e-161 0.6865265 0.836 0.333       0  IL32
LTB   5.232177e-160 0.8968131 0.932 0.528       0   LTB

pbmc.markers %>%
  group_by(cluster) %>%
  top_n(10, avg_diff) -> top10

# A tibble: 6 x 6
# Groups:   cluster [1]
          p_val  avg_diff pct.1 pct.2 cluster  gene
          <dbl>     <dbl> <dbl> <dbl>  <fctr> <chr>
1 9.480623e-268 1.1490575 0.924 0.483       0  LDHB
2 1.576099e-240 1.0468542 0.870 0.250       0  CD3D
3 5.232177e-160 0.8968131 0.932 0.528       0   LTB
4 1.254598e-154 0.9722549 0.768 0.267       0  CD3E
5 3.519020e-133 1.0681216 0.662 0.202       0  IL7R
6 6.578712e-106 0.8980251 0.649 0.259       0 NOSIP

DoHeatmap(object = pbmc,
          genes.use = top10$gene,
 = TRUE,
          slim.col.label = TRUE,
          remove.key = TRUE)

There are canonical markers for each cell type, which can be used to assess the clustering.

Cluster ID Markers Cell Type
0 IL7R CD4 T cells
1 CD14, LYZ CD14+ Monocytes
2 MS4A1 B cells
3 CD8A CD8 T cells
4 FCGR3A, MS4A7 FCGR3A+ Monocytes
5 GNLY, NKG7 NK cells
6 FCER1A, CST3 Dendritic Cells
7 PPBP Megakaryocytes
FeaturePlot(object = pbmc,
            features.plot = c("IL7R", "CD14", "LYZ", "MS4A1", "CD8A", "FCGR3A", "MS4A7", "GNLY", "NKG7", "FCER1A", "CST3", "PPBP"),
            cols.use = c("grey", "blue"),
            reduction.use = "tsne")

current.cluster.ids <- c(0, 1, 2, 3, 4, 5, 6, 7)
new.cluster.ids <- c("CD4 T cells",
                     "CD14+ Monocytes",
                     "B cells",
                     "CD8 T cells",
                     "FCGR3A+ Monocytes",
                     "NK cells",
                     "Dendritic cells",

pbmc@ident <- plyr::mapvalues(x = pbmc@ident,
                              from = current.cluster.ids,
                              to = new.cluster.ids)

TSNEPlot(object = pbmc, do.label = TRUE, pt.size = 0.5)

Seurat provides the StashIdent() function for keeping cluster IDs; this is useful for testing various parameters and comparing the clusters. For example, adjusting the parameters may lead to the CD4 T cells subdividing into two groups.

# stash cluster identities for later
pbmc <- StashIdent(object = pbmc, = "ClusterNames_0.6")

pbmc <- FindClusters(object = pbmc,
                     reduction.type = "pca",
                     dims.use = 1:10,
                     resolution = 0.8,
                     print.output = FALSE)

# plot two tSNE plots side by side, and colour points based on different criteria
plot1 <- TSNEPlot(object = pbmc,
                  do.return = TRUE,
                  no.legend = TRUE,
                  do.label = TRUE)

plot2 <- TSNEPlot(object = pbmc,
                  do.return = TRUE,
         = "ClusterNames_0.6",
                  no.legend = TRUE,
                  do.label = TRUE)

plot_grid(plot1, plot2)

Print Friendly, PDF & Email

Creative Commons License
This work is licensed under a Creative Commons
Attribution 4.0 International License
30 comments Add yours
  1. Thanks! This is very helpful in understanding their tutorial a bit more.

    Could you elaborate on why scaling and centering the data is useful for dimension reduction? I get the purpose of regressing out unwanted heterogeneity that they’ve coupled into the scaling step, by why shouldn’t you just PCA and tSNE on the log normalized data?


  2. Thank you so much! I have read your two blogs about monocle and seurat. And they have helped me a lot in the analysis of scRNA-seq.

    And i want to know that as there are many sofawares including monocle and seurat,what should we choose for analysis? As i know, cellrangerRkit can also do clustering, DEG analysis as seurat. And SC3 is a new software for clustering which is used in some papers. So what’s the difference between these softwares and how can i choose the suitable one?

    Thanks again!

  3. Thanks! This tutorial is very helpful!
    I have a question, in the GenePlot of percent.mito vs nUMI, the title the -0.13, what does it represents?

    1. Looking at the source code for the GenePlot(), it should be the Pearson correlation between the two variables (percent.mito vs nUMI).

  4. Thank you so much for your blog on Seurat!
    I have a question on using FindMarkers, I’d like to get statistical result on all variable genes that I input in the function, and I set logfc.threshold = 0, min.pct = 0, min.cells = 0, and return.thresh = 1. But there are still a couple hundred genes missing from the output. Am I doing something wrong here?? and is there other way to get all the variable genes?

    Thank you so much for your time!

    1. Hi Yuri, thanks for the comment. Actually, this is a question that interests me as well, since I had wanted to perform a GSEA and needed the fold change and p-values of all genes but just like you I couldn’t get the full list of genes back. You can ask the developers at

  5. Hi,
    Thanks for the tutorial, it is very helpful. I have a question, how does one get from the TSNE calculation, Cell to Cluster identity data- I would like to have have a column of all the cells and their corresponding Cluster attribute for example.

    1. The cluster identity will be stored in the slot; the cell identifiers are the row names.

      1. HI, DAVO,
        I also would like to get the full list of genes, when i set logfc.threshold = 0, min.pct = 0, min.cells = 0, there are still 400 genes missed. so have you fixed it ?


  6. Hi,
    Thanks for the tutorial. It’s very useful.
    I have several questions:

    1) The plots generated using VizPCA, what does the values on the x-axis means?

    2) I am slightly confused with the jackstraw procedure. Could you please clarify if my understanding below is correct?
    1st: Randomly permutes a subset of genes from the dataset
    2nd: Perform PCA to get PCA scores – what are PCA scores?
    3rd: Compares the original PCs with the PCA scores of these random genes to determine p-value

    3) I also don’t understand the last statement in the jackstraw procedure – “End result is a p-value for each gene’s association with each principal component.” The p-value is the association of the PC with the random subset of genes isn’t it? Not each gene’s association right? Could you please explain? I’m rather confused here.

    4) What does the SD of PC means in the Elbow plot? How is this SD calculated?

    Hope to hear back from you soon.

    Thank you very much.

  7. Thank you for nice and kind tutorial. This is really helpful for me.
    I have a question about data handling and plotting.
    I’m using 10x data generated from 3 different sample (aggregated data). I’ve successfully marked each cluster with cell type based on marker gene expression according to this tutorial. But, I can’t know which cluster(cells) come from which samples. Can I mark sample identity together with cluster name (or marker gene) in a single plot?

    1. If you add the relevant information to the slot using the AddMetaData() function, you can use ggplot2 and map the meta data to various aesthetics in a scatter plot.

  8. Thank you for your blog on Seurat and Monocle!
    I have a question: I want to do cell cluster by Seurat, and use the result of treatment from Seurat to construct trajectories by Monocle. What should I do to get the data that Monocle can read in? Thank you.

    1. You can get the cell cluster information from the slot in the Seurat object and add this to the Monocle object as phenoData. Similarly, you can output the data in the slot of the Seurat object and use it as the expression matrix when creating the Monocle object.

  9. Hello Dave.
    Thanks a ton for theses blogs. Helped me a lot. I have a doubt. I wanted to use SingleR, a new package to annotate the cell type from single cell rna seq Data. I did all other analysis in Seurat. So I wanted to use expression matrix from Seurat in SingleR. So I used object@data in place of matrix and out worked fine. But I see that object@data is only normalized expression matrix. The data where nUMI, percent.mito regreesed out is found in But I think scaling introduces negative expression values or something, because SingleR annotation goes haywire if I use The two arguments in the function of Seurat- do.scale and, Can any of these be helpful to me to create the most nearest Seurat object for annotation?

    1. I haven’t used SingleR yet but yes, the scaling step does introduce negative values. If you set the center to FALSE, it shouldn’t perform the centering, and that should leave you with only positive values.

  10. Hi Dave,

    I was wondering if you had any advice regarding the selection of highly variable genes?

    In the following tutorial:

    They suggest playing with the parameters until it the plot marks outliers, but I’m not 100% on interpreting the plot and where exactly I should be looking for these outliers?

    Meanwhile in this tutorial:

    They basically run default parameters and then select the top 1k variable genes (but don’t justify why they pick only 1k, when the first tutorial resulted in ~2k HVGs)

    I’m worried as it seems like an important step as you run the PCA based on these genes.

    Thanks in advance,


  11. Hi Dave

    Thanks for your useful blog.

    In my project, I merged four different single cell RNA-seq data set into one Seurat object and did Louvain clustering on the object. In the resulted t-SNE I want to know what proportion of each cluster is made by different samples. Do you have any suggestion to do that?


    1. Hi Moein,

      the meta data slot in the Seurat object contains the cluster membership numbers. If you have the sample factors in the meta data, you can simply use the table() function in R to create a table of cluster vs. samples.



  12. I am bit confused:

    # check how many genes have at least one transcript in each cell
    at_least_one 0))
    hist(at_least_one, breaks = 100,main = “Distribution of detected genes”,
    xlab = “Genes with at least one tag”,ylim = c(0,400))

    If I understood correctly “check how many genes have at least one transcript in each cell” should mean consider genes with transcripts greater than 0 in each cell

    for example if we have data with 3 cells and 5 genes like follows
    > temp.df colnames(temp.df) rownames(temp.df) temp.df
    C1 C2 C3
    G1 0 1 6
    G2 0 1 0
    G3 1 8 3
    G4 6 0 4
    G5 3 6 7

    We will consider genes G3 and G5 only thereby reducing the matrix as following
    > row_sub = apply(temp.df, 1, function(row) all(row !=0 ))
    > (temp.sub (at_least_one 0)))
    C1 C2 C3
    3 4 4

    This is my understanding is “for each cell how many genes have more than one transcript”, i.e. for C1, G3, G4 and G5 have more than one transcript and so on.

    Now, the command: hist(at_least_one, breaks = 100,main = “Distribution of detected genes”, xlab = “Genes with at least one tag”)
    will plot the frequency of the genes for each cell, i.e. 2 cells have 4 genes and 1 cell have 3 genes which look very different from “Genes with at least one tag”

    I might very well have not understood clearly and apologies if I have asked a silly question but I hope someone can clarify my doubt

  13. Your Blogs have been very helpful thanks. In this example what are the units for the y-axis in the vlnplot of specific genes?

  14. HI Dave,

    In ScoreJackStraw, # of dims are limited to 20, and hence the JackStrawPlot. Is there any way I can set n of dims over 20 ?

  15. Have you been able to crack the color code for the DoHeatmap function? Meaning, have you a couple examples of a DIY divergent purple-yellow heatmap color palette for heatmaps not made with seurat data? Thank you!! Huge fan

    1. Hi there,

      the heatmap is a ggplot object, so you can modify the colours using ggplot2 functions. For example,

      my_heatmap <- DoHeatmap(object = pbmc_small) my_heatmap + scale_fill_gradient(low = "white", high = "red")

      Thanks for checking out the blog!


Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.