Getting started with Monocle

Monocle is an R package developed for analysing single cell gene expression data. Specifically, the package provides functionality for clustering and classifying single cells, conducting differential expression analyses, and constructing and investigating inferred developmental trajectories. The toolkit provides various alternative approaches for each analysis, hence your workflow may differ from the approach I've taken in this post.

I will use an open dataset provided by 10x Genomics, which was generated from Peripheral Blood Mononuclear Cells (PBMCs) from a healthy donor. PBMCs are primary cells, which are cells isolated directly from tissue using enzymatic or mechanical methods, with relatively small amounts of RNA (around 1pg RNA per cell). The data was sequenced on an Illumina HiSeq 4000 with approximately 92,000 reads per 8,381 cells. This dataset was processed with Cell Ranger 2.0.1 with the parameter --cells=10000.

To get started install Monocle from Bioconductor.

source("http://bioconductor.org/biocLite.R")
biocLite("monocle")

# load library
library(monocle)

# version used for this post
packageVersion('monocle')
[1] ‘2.4.0’

Now download the processed dataset (around 37M).

wget -c http://cf.10xgenomics.com/samples/cell-exp/2.0.1/pbmc8k/pbmc8k_filtered_gene_bc_matrices.tar.gz

# on OS X it's md5
# on other Linux distros it is typically md5sum
md5 pbmc8k_filtered_gene_bc_matrices.tar.gz 
MD5 (pbmc8k_filtered_gene_bc_matrices.tar.gz) = 1ddd33d2b3de26b6d058fba089a72614

We are going to use the Cell Ranger R Kit package to read the 10X single cell data into R. Install it if you haven't already.

# see http://cf.10xgenomics.com/supp/cell-exp/cellrangerrkit-PBMC-vignette-knitr-2.0.0.pdf
source("http://cf.10xgenomics.com/supp/cell-exp/rkit-install-2.0.0.R")

# load library
library(cellrangerRkit)

# version used for this post
packageVersion("cellrangerRkit")
[1] ‘2.0.0’

The load_cellranger_matrix() function in the cellrangerRkit is used to load the downloaded single cell data into R. However, the function expects a certain directory structure, so we need to move the filtered_gene_bc_matrices directory into an outs directory.

# extract
tar -xzf pbmc8k_filtered_gene_bc_matrices.tar.gz

mkdir outs
mv filtered_gene_bc_matrices/ outs/

Now we're ready to load the pbmc8k data into R. Note that I have set the variable my_dir to the directory of where the downloaded dataset is located; modify this accordingly.

# change this according to where you downloaded and extracted the data
my_dir <- "~/muse/monocle/pbmc8k/"

# load data
gbm <- load_cellranger_matrix(my_dir)
Searching for genomes in: ~/muse/monocle/pbmc8k//outs/filtered_gene_bc_matrices 
Using GRCh38 in folder: ~/muse/monocle/pbmc8k//outs/filtered_gene_bc_matrices/GRCh38 
Loaded matrix information
Loaded gene information
Loaded barcode information
Could not find summary csv: 
	 ~/muse/monocle/pbmc8k//outs/metrics_summary.csv.
This file is only necessary if you are performing depth-normalization (calling the equalize_gbms function) in R.
If this pipestance was produced by `cellranger aggr` with the default parameters, depth-normalization in R (via equalize_gbms) is not necessary.

# what class is the object?
class(gbm)
[1] "GeneBCMatrix"
attr(,"package")
[1] "cellrangerRkit"

Conveniently, we can use the exprs(), pData(), and fData() functions to retrieve the expression, experimental phenotypes (phenoData), and feature information, respectively, from the gbm object. These functions are from the Bioconductor ExpressionSet class, which is handy because the CellDataSet class used in Monocle is also derived from this class.

# check out the expression data using exprs()
# 33,694 genes across 8,381 single cells
dim(exprs(gbm))
[1] 33694  8381

# nothing exciting going on in the first five genes in the first five cells
exprs(gbm)[1:5, 1:5]
5 x 5 sparse Matrix of class "dgTMatrix"
                AAACCTGAGCATCATC-1 AAACCTGAGCTAACTC-1 AAACCTGAGCTAGTGG-1 AAACCTGCACATTAGC-1 AAACCTGCACTGTTAG-1
ENSG00000243485                  .                  .                  .                  .                  .
ENSG00000237613                  .                  .                  .                  .                  .
ENSG00000186092                  .                  .                  .                  .                  .
ENSG00000238009                  .                  .                  .                  .                  .
ENSG00000239945                  .                  .                  .                  .                  .

# check out the phenotypic data
dim(pData(gbm))
[1] 8381    1

head(pData(gbm))
                              barcode
AAACCTGAGCATCATC-1 AAACCTGAGCATCATC-1
AAACCTGAGCTAACTC-1 AAACCTGAGCTAACTC-1
AAACCTGAGCTAGTGG-1 AAACCTGAGCTAGTGG-1
AAACCTGCACATTAGC-1 AAACCTGCACATTAGC-1
AAACCTGCACTGTTAG-1 AAACCTGCACTGTTAG-1
AAACCTGCATAGTAAG-1 AAACCTGCATAGTAAG-1

# check out the feature information
dim(fData(gbm))
[1] 33694     2

# this table will be useful for matching gene IDs to symbols
head(fData(gbm))
                             id        symbol
ENSG00000243485 ENSG00000243485  RP11-34P13.3
ENSG00000237613 ENSG00000237613       FAM138A
ENSG00000186092 ENSG00000186092         OR4F5
ENSG00000238009 ENSG00000238009  RP11-34P13.7
ENSG00000239945 ENSG00000239945  RP11-34P13.8
ENSG00000239906 ENSG00000239906 RP11-34P13.14

The main object class in Monocle is the CellDataSet object; to get started we need to create a CellDataSet object with the newCellDataSet() function. The CellDataSet object was derived from the ExpressionSet class, so it's easy to create, since the gbm object was also derived from the same class. However, Monocle expects that the gene symbol column in the feature data is called gene_short_name or else you will get a warning.

# warning!
my_cds <- newCellDataSet(exprs(gbm),
                         phenoData = new("AnnotatedDataFrame", data = pData(gbm)),
                         featureData = new("AnnotatedDataFrame", data = fData(gbm)),
                         lowerDetectionLimit = 0.5,
                         expressionFamily = negbinomial.size())

Warning message:
In newCellDataSet(exprs(gbm), phenoData = new("AnnotatedDataFrame",  :
  None of your featureData columns are named 'gene_short_name', some functions will not be able
           to take this function as input as a result

# rename gene symbol column
my_feat <- fData(gbm)
names(my_feat) <- c('id', 'gene_short_name')

# no warning
my_cds <- newCellDataSet(exprs(gbm),
                         phenoData = new("AnnotatedDataFrame", data = pData(gbm)),
                         featureData = new("AnnotatedDataFrame", data = my_feat),
                         lowerDetectionLimit = 0.5,
                         expressionFamily = negbinomial.size())

my_cds
CellDataSet (storageMode: environment)
assayData: 33694 features, 8381 samples 
  element names: exprs 
protocolData: none
phenoData
  sampleNames: AAACCTGAGCATCATC-1 AAACCTGAGCTAACTC-1 ... TTTGTCATCTGCTTGC-1 (8381 total)
  varLabels: barcode Size_Factor
  varMetadata: labelDescription
featureData
  featureNames: ENSG00000243485 ENSG00000237613 ... ENSG00000268674 (33694 total)
  fvarLabels: id gene_short_name
  fvarMetadata: labelDescription
experimentData: use 'experimentData(object)'
Annotation:

# see ?CellDataSet for more information
slotNames(my_cds)
 [1] "reducedDimS"           "reducedDimW"           "reducedDimA"           "reducedDimK"          
 [5] "minSpanningTree"       "cellPairwiseDistances" "expressionFamily"      "lowerDetectionLimit"  
 [9] "dispFitInfo"           "dim_reduce_type"       "auxOrderingData"       "auxClusteringData"    
[13] "experimentData"        "assayData"             "phenoData"             "featureData"          
[17] "annotation"            "protocolData"          ".__classVersion__"

It is important that the appropriate expression family is used when creating a new CellDataSet. We have UMI data (that is quite large), so we'll use negbinomial.size(). If your data contains relative counts (e.g. FPKM or TPM values), you can use relative2abs() to convert these measurements into absolute counts.

Next, we'll perform some normalisation and variance estimation steps, which will be used in the differential expression analyses later on.

my_cds <- estimateSizeFactors(my_cds)
my_cds <- estimateDispersions(my_cds)
Removing 146 outliers
Warning message:
Deprecated, use tibble::rownames_to_column() instead.

Now let's explore the data! First we will need to run the detectGenes() function, which tallies the number of cells expressing a gene and the number of genes expressed among all cells. This will be clearer when you see the tables.

my_cds <- detectGenes(my_cds, min_expr = 0.1)

head(fData(my_cds))
                             id gene_short_name num_cells_expressed
ENSG00000243485 ENSG00000243485    RP11-34P13.3                   0
ENSG00000237613 ENSG00000237613         FAM138A                   0
ENSG00000186092 ENSG00000186092           OR4F5                   0
ENSG00000238009 ENSG00000238009    RP11-34P13.7                  24
ENSG00000239945 ENSG00000239945    RP11-34P13.8                   1
ENSG00000239906 ENSG00000239906   RP11-34P13.14                   0

summary(fData(my_cds)$num_cells_expressed)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    0.0     0.0     5.0   349.9   234.0  8381.0

The num_cells_expressed column is a tally of the number of cells expressing a particular gene (a gene is "expressed" if there is at least one count since we set min_expr = 0.1). For example if we tally ENSG00000239945 across all cells, we should get 1, as indicated in the table above.

sum((exprs(my_cds['ENSG00000239945',])))
[1] 1

sum((exprs(my_cds['ENSG00000238009',])))
[1] 24

The number of genes expressed (num_genes_expressed) per cell is stored in phenoData. Note that if a gene has 10 UMIs or 1 UMI, it is still tallied as 1.

head(pData(my_cds))
                              barcode Size_Factor num_genes_expressed
AAACCTGAGCATCATC-1 AAACCTGAGCATCATC-1   0.5672842                 871
AAACCTGAGCTAACTC-1 AAACCTGAGCTAACTC-1   0.4014116                 806
AAACCTGAGCTAGTGG-1 AAACCTGAGCTAGTGG-1   1.0710628                1316
AAACCTGCACATTAGC-1 AAACCTGCACATTAGC-1   0.6606467                 898
AAACCTGCACTGTTAG-1 AAACCTGCACTGTTAG-1   1.1058961                1526
AAACCTGCATAGTAAG-1 AAACCTGCATAGTAAG-1   1.0521060                1495

# sanity check for AAACCTGAGCATCATC-1
sum((exprs(my_cds)[,"AAACCTGAGCATCATC-1"])>0)
[1] 871

summary(pData(my_cds)$num_genes_expressed)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    502    1111    1297    1407    1548    4800

# standardise to Z-distribution
x <- pData(my_cds)$num_genes_expressed
x_1 <- (x - mean(x)) / sd(x)
summary(x_1)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-1.8839 -0.6155 -0.2282  0.0000  0.2946  7.0675

library(ggplot2)
# I like the default theme of cowplot
library(cowplot)

df <- data.frame(x = x_1)
ggplot(df, aes(x)) +
  geom_histogram(bins = 50) +
  geom_vline(xintercept = c(-2, 2), linetype = "dotted", color = 'red')

Only a few cells have an abnormal number of expressed genes, such as the ones that are 5 standard deviations away from the mean.

We can also add our own metadata, such as the UMI count, to the phenoData.

# add a UMI column into phenoData
# use Matrix because data is in the sparse format
pData(my_cds)$UMI <- Matrix::colSums(exprs(my_cds))

head(pData(my_cds))
                              barcode Size_Factor num_genes_expressed  UMI
AAACCTGAGCATCATC-1 AAACCTGAGCATCATC-1   0.5672842                 871 2394
AAACCTGAGCTAACTC-1 AAACCTGAGCTAACTC-1   0.4014116                 806 1694
AAACCTGAGCTAGTGG-1 AAACCTGAGCTAGTGG-1   1.0710628                1316 4520
AAACCTGCACATTAGC-1 AAACCTGCACATTAGC-1   0.6606467                 898 2788
AAACCTGCACTGTTAG-1 AAACCTGCACTGTTAG-1   1.1058961                1526 4667
AAACCTGCATAGTAAG-1 AAACCTGCATAGTAAG-1   1.0521060                1495 4440

ggplot(pData(my_cds), aes(num_genes_expressed, UMI)) + geom_point()

We could remove the cells with much higher gene (and UMI) counts as they might be doublets but I'll include them in this analysis.

Clustering cells without marker genes

Monocle provides several alternative approaches for clustering cells and the recommended approach is to use marker genes. However, since I don't know of any specific markers for this dataset I will use the unsupervised approach. The first step is determining a subset of genes to use for clustering; this is because not all genes are informative, such as those that are lowly expressed. The approach is to select gene based on their average expression and variability across cells. The dispersionTable() function calculates the mean and dispersion values.

disp_table <- dispersionTable(my_cds)
head(disp_table)
          gene_id mean_expression dispersion_fit dispersion_empirical
1 ENSG00000238009    0.0026983425      52.863362          0.000000000
2 ENSG00000279457    0.1548382003       1.068659          0.356959667
3 ENSG00000228463    0.0543320140       2.767983          0.005359107
4 ENSG00000237094    0.0022958750      62.104025         30.358663597
5 ENSG00000230021    0.0002830376     502.693229          0.000000000
6 ENSG00000237491    0.0245526857       5.943231          0.000000000

We will select genes, which have a mean expression >= 0.1, to use in the clustering step. The setOrderingFilter() function allows us to indicate which genes we want to use for clustering. The plot_ordering_genes() function plots mean expression against the empirical dispersion and highlights the set of genes (as black dots) that will be used for clustering.

table(disp_table$mean_expression>=0.1)

FALSE  TRUE 
15422  4012

unsup_clustering_genes <- subset(disp_table, mean_expression >= 0.1)

my_cds <- setOrderingFilter(my_cds, unsup_clustering_genes$gene_id)
plot_ordering_genes(my_cds)
Warning messages:
1: Transformation introduced infinite values in continuous y-axis 
2: Transformation introduced infinite values in continuous y-axis

It will take a lot of computational time to cluster 8,381 cells even when we have subsetted the total number of genes down to 4,012. The usual approach is to perform dimension reduction (typically Principal Component Analysis [PCA]) and use the principal components that explain the majority of the variance in the data. The plot_pc_variance_explained() function plots the percentage of variance explained by each component based on a PCA performed on the normalised expression data. From the plot we can decide the number of components to use in our downstream analyses.

plot_pc_variance_explained(my_cds, return_all = FALSE)
Warning in (function (A, nv = 5, nu = nv, maxit = 100, work = nv + 7, reorth = TRUE,  :
  did not converge--results might be invlaid!; try increasing maxit or fastpath=FALSE

The goal is to identify the elbow in this plot, which can be a bit subjective.

The warning message is from the irlba() function, which was used to perform the PCA. The "results might be invlaid [sic]" warning is a bit worrying though, so I'm not sure whether or not the plot is valid. I'll perform dimension reduction using five and ten principal components (by setting the num_dim parameter, which is used as input to the prcomp_irlba() function) and compare the clustering results.

The clusterCells() function performs the clustering on a CellDataSet; however, you need to input the number of requested clusters (I arbitrarily chose 15). The results of the clustering are stored in the phenoData table.

# if num_dim is not set, the default is set to 50
# I'll use 5 here based on plot_pc_variance_explained
my_cds <- reduceDimension(my_cds, max_components = 2, num_dim = 5,
                          reduction_method = 'tSNE', verbose = TRUE)
Remove noise by PCA ...
Reduce dimension by tSNE ...

# perform unsupervised clustering requesting 15-1 clusters
my_cds <- clusterCells(my_cds, num_clusters = 15)
Distance cutoff calculated to 4.797959 
Warning messages:
1: In if (method == "DDRTree") { :
  the condition has length > 1 and only the first element will be used
2: In if (method == "densityPeak") { :
  the condition has length > 1 and only the first element will be used

# cluster information is in the Cluster column
head(pData(my_cds))
                              barcode Size_Factor num_genes_expressed Cluster peaks  halo      delta      rho
AAACCTGAGCATCATC-1 AAACCTGAGCATCATC-1   0.5672842                 871       4 FALSE FALSE 0.44962230 145.2827
AAACCTGAGCTAACTC-1 AAACCTGAGCTAACTC-1   0.4014116                 806      12 FALSE  TRUE 0.33860313 131.3374
AAACCTGAGCTAGTGG-1 AAACCTGAGCTAGTGG-1   1.0710628                1316       5 FALSE  TRUE 0.87331834 128.1797
AAACCTGCACATTAGC-1 AAACCTGCACATTAGC-1   0.6606467                 898       9 FALSE  TRUE 0.36330440 123.4466
AAACCTGCACTGTTAG-1 AAACCTGCACTGTTAG-1   1.1058961                1526      11 FALSE  TRUE 0.15344369 104.5582
AAACCTGCATAGTAAG-1 AAACCTGCATAGTAAG-1   1.0521060                1495      11 FALSE  TRUE 0.07492976 119.3500

# store cluster info for comparing later
my_cluster_dim_5 <- pData(my_cds)$Cluster

plot_cell_clusters(my_cds)

Now with num_dim = 10.

my_cds <- reduceDimension(my_cds, max_components = 2, num_dim = 10,
                          reduction_method = 'tSNE', verbose = TRUE)
Remove noise by PCA ...
Reduce dimension by tSNE ...

my_cds <- clusterCells(my_cds, num_clusters = 15)
Distance cutoff calculated to 4.29865 
Warning messages:
1: In if (method == "DDRTree") { :
  the condition has length > 1 and only the first element will be used
2: In if (method == "densityPeak") { :
  the condition has length > 1 and only the first element will be used

my_cluster_dim_10 <- pData(my_cds)$Cluster

plot_cell_clusters(my_cds)

We'll use the Adjusted Rand Index (ARI) to compare the clustering results. The adjustedRand() function in the clues package calculates the Rand Index (RI) and different variations of the ARI (HA, MA, and FM).

library(clues)
adjustedRand(as.numeric(my_cluster_dim_5), as.numeric(my_cluster_dim_10))
     Rand        HA        MA        FM   Jaccard 
0.9047255 0.3825039 0.3833038 0.4349620 0.2774384

I wrote about why there is a difference between the RI and ARI. The ARI (HA) suggests that the clustering of cells is quite different when including different numbers of principal components. I also compared clustering results with num_dim = 4 and num_dim = 5 and while the ARI is higher than the comparison between num_dim = 5 and num_dim = 10, it still implies that the cell clusters are quite different.

# include 4 dimensions
my_cds <- reduceDimension(my_cds, max_components = 2, num_dim = 4,
                          reduction_method = 'tSNE', verbose = TRUE)
my_cds <- clusterCells(my_cds, num_clusters = 15)
my_cluster_dim_4 <- pData(my_cds)$Cluster

# include 10 dimensions
my_cds <- reduceDimension(my_cds, max_components = 2, num_dim = 5,
                          reduction_method = 'tSNE', verbose = TRUE)
my_cds <- clusterCells(my_cds, num_clusters = 15)
my_cluster_dim_10 <- pData(my_cds)$Cluster

# compare clustering results using 4 and 5 principal components
clues::adjustedRand(as.numeric(my_cluster_dim_4), as.numeric(my_cluster_dim_5))
     Rand        HA        MA        FM   Jaccard 
0.9244489 0.5236041 0.5242011 0.5650728 0.3936819

For the rest of the analysis, I will be using num_dim = 10 and num_clusters = 15.

# include 10 dimensions
my_cds <- reduceDimension(my_cds, max_components = 2, num_dim = 10,
                          reduction_method = 'tSNE', verbose = TRUE)
my_cds <- clusterCells(my_cds, num_clusters = 15)

Differential expression

One of the main interests of single cell transcriptomics is the identification of novel markers for cell types. We can use differentialGeneTest(), which is quite a flexible function, to find marker genes by assessing each gene's expression level across the cells. The model formulae used in the function can use any column in the phenoData table, including columns that we manually add. Here's the description of how it works from the Monocle tutorial page:

Monocle's differential expression analysis works essentially by fitting two models to the expression values for each gene, working through each gene independently. The first of the two models is called the full model. This model is essentially a way of predicting the expression value of each gene in a given cell knowing only whether that cell is a fibroblast or a myoblast. The second model, called the reduced model, does the same thing, but it doesn't know the CellType for each cell. It has to come up with a reasonable prediction of the expression value for the gene that will be used for all the cells. Because the full model has more information about each cell, it will do a better job of predicting the expression of the gene in each cell. The question Monocle must answer for each gene is how much better the full model's prediction is than the reduced model's. The greater the improvement that comes from knowing the CellType of each cell, the more significant the differential expression result.

I'll create a vector that indicates whether a single was clustered in "Cluster 1" or not to identify genes differentially expressed in cluster 1 versus the other clusters.

# create vector of no's
my_vector <- rep('no', nrow(pData(my_cds)))

# change status to yes if the cell was in cluster 1
my_vector[pData(my_cds)$Cluster == 1] <- rep('yes', sum(pData(my_cds)$Cluster == 1))

# add vector to phenoData
pData(my_cds)$test <- my_vector
head(pData(my_cds))
                              barcode Size_Factor num_genes_expressed Cluster peaks  halo     delta       rho test my_colour
AAACCTGAGCATCATC-1 AAACCTGAGCATCATC-1   0.5672842                 871       7 FALSE FALSE 0.1914281 174.08627   no     FALSE
AAACCTGAGCTAACTC-1 AAACCTGAGCTAACTC-1   0.4014116                 806       6 FALSE  TRUE 0.1170417 142.70796   no      TRUE
AAACCTGAGCTAGTGG-1 AAACCTGAGCTAGTGG-1   1.0710628                1316       3 FALSE FALSE 0.9993782 149.51023   no     FALSE
AAACCTGCACATTAGC-1 AAACCTGCACATTAGC-1   0.6606467                 898      12 FALSE  TRUE 0.3166583 153.24806   no     FALSE
AAACCTGCACTGTTAG-1 AAACCTGCACTGTTAG-1   1.1058961                1526       1 FALSE  TRUE 0.1818162  91.74234  yes      TRUE
AAACCTGCATAGTAAG-1 AAACCTGCATAGTAAG-1   1.0521060                1495       4 FALSE  TRUE 0.7871118 117.33641   no      TRUE

I'll perform the differential expression analysis on the subset of genes where the mean expression (across cells) was >= 0.1 (see section "Clustering cells without marker genes").

length(unsup_clustering_genes$gene_id)
[1] 4012

de_cluster_one <- differentialGeneTest(my_cds[unsup_clustering_genes$gene_id,],
                                       fullModelFormulaStr = '~test',
                                       cores = 8)

# not sure why we have results only for 479 genes out of 4,012
dim(de_cluster_one)
[1] 479   8

# order by q-value
library(dplyr)
de_cluster_one %>% arrange(qval) %>% head()
  status           family          pval          qval              id gene_short_name num_cells_expressed use_for_ordering
1     OK negbinomial.size  0.000000e+00  0.000000e+00 ENSG00000163131            CTSS                5547             TRUE
2     OK negbinomial.size  0.000000e+00  0.000000e+00 ENSG00000196126        HLA-DRB1                5303             TRUE
3     OK negbinomial.size  0.000000e+00  0.000000e+00 ENSG00000203747          FCGR3A                 951             TRUE
4     OK negbinomial.size  0.000000e+00  0.000000e+00 ENSG00000204287         HLA-DRA                6195             TRUE
5     OK negbinomial.size  0.000000e+00  0.000000e+00 ENSG00000227507             LTB                6720             TRUE
6     OK negbinomial.size 3.219336e-244 2.483842e-242 ENSG00000182287           AP1S2                3697             TRUE

Let's check out CTSS (ENSG00000163131), which is the most statistically significant gene, using plot_genes_fitter().

plot_genes_jitter(my_cds['ENSG00000163131',], grouping = "Cluster")

CTSS is specific for cluster 1 but also for clusters 4 and 6. We count data hence the expression is on a discrete scale.

We can use plot_cell_clusters() to highlight clusters 1, 4, and 6.

# add another column to the phenotypic data
# where TRUE means cluster membership in clusters 1, 4, or 6
pData(my_cds)$my_colour <- pData(my_cds)$Cluster == 1 | pData(my_cds)$Cluster == 4 | pData(my_cds)$Cluster == 6

plot_cell_clusters(my_cds, color_by = 'my_colour')

CTSS is overexpressed in clusters 1, 4, and 6, which are highlighted in turquoise.

Constructing Single Cell Trajectories

In Monocle, a single cell trajectory is the inferred developmental timeline of single cells. From the tutorial:

Monocle uses an algorithm to learn the sequence of gene expression changes each cell must go through as part of a dynamic biological process. Once it has learned the overall "trajectory" of gene expression changes, Monocle can place each cell at its proper position in the trajectory.

The "unit" used for the trajectory is pseudotime; a cell at the beginning of the trajectory, i.e. starting state, will have a lower pseudotime than cells at the end of the trajectory, i.e. end state. Specifically, it is:

the distance between a cell and the start of the trajectory, measured along the shortest path. The trajectory's total length is defined in terms of the total amount of transcriptional change that a cell undergoes as it moves from the starting state to the end state.

The trajectory analysis consists of three stages:

  1. Choose genes that define progress
  2. Reduce the dimensionality of the data
  3. Order cells in pseudotime

For the trajectory analysis in this post, I will use only a subset of all cells (cluster 1, 4, and 6) and genes that are expressed in at least 10 cells.

expressed_genes <- row.names(subset(fData(my_cds), num_cells_expressed >= 10))

# my_colour is TRUE if a cell belongs to either cluster 1, 4, or 6
my_cds_subset <- my_cds[expressed_genes, pData(my_cds)$my_colour]

# 15,446 genes and 2,060 cells
my_cds_subset
CellDataSet (storageMode: environment)
assayData: 15446 features, 2060 samples 
  element names: exprs 
protocolData: none
phenoData
  sampleNames: AAACCTGAGCTAACTC-1 AAACCTGCACTGTTAG-1 ... TTTGTCATCGTCTGAA-1 (2060 total)
  varLabels: barcode Size_Factor ... my_colour (10 total)
  varMetadata: labelDescription
featureData
  featureNames: ENSG00000238009 ENSG00000279457 ... ENSG00000271254 (15446 total)
  fvarLabels: id gene_short_name num_cells_expressed use_for_ordering
  fvarMetadata: labelDescription
experimentData: use 'experimentData(object)'
Annotation:

We'll use the recommended approach of ordering based on genes that differ between clusters. First, we'll perform another subset of genes, keeping only genes expressed in greater than 5% of cells.

my_cds_subset <- detectGenes(my_cds_subset, min_expr = 0.1)
fData(my_cds_subset)$use_for_ordering <- fData(my_cds_subset)$num_cells_expressed > 0.05 * ncol(my_cds_subset)

# how many genes are used?
table(fData(my_cds_subset)$use_for_ordering)

FALSE  TRUE 
 9141  6305

plot_pc_variance_explained(my_cds_subset, return_all = FALSE)
Warning in (function (A, nv = 5, nu = nv, maxit = 100, work = nv + 7, reorth = TRUE,  :
  did not converge--results might be invlaid!; try increasing maxit or fastpath=FALSE

There's the same warning message as before when running plot_pc_variance_explained(), so I'll just use num_dim = 10 and perform clustering as before (see section "Clustering cells without marker genes") but this time without specifying the number of clusters; we will use thresholds on the cell's local density (rho) and nearest distance (delta) to determine the number of clusters.

my_cds_subset <- reduceDimension(my_cds_subset,
                                 max_components = 2,
                                 norm_method = 'log',
                                 num_dim = 10,
                                 reduction_method = 'tSNE',
                                 verbose = TRUE)

my_cds_subset <- clusterCells(my_cds_subset, verbose = FALSE)
Distance cutoff calculated to 3.551792 
Warning messages:
1: In if (method == "DDRTree") { :
  the condition has length > 1 and only the first element will be used
2: In if (method == "densityPeak") { :
  the condition has length > 1 and only the first element will be used

plot_rho_delta(my_cds_subset, rho_threshold = 2, delta_threshold = 10)

Scatter plot of rho versus delta.

We'll use rho = 2 and delta = 10 to cluster the cells again.

my_cds_subset <- clusterCells(my_cds_subset,
                              rho_threshold = 2,
                              delta_threshold = 10,
                              skip_rho_sigma = T,
                              verbose = FALSE)
Warning messages:
1: In if (method == "DDRTree") { :
  the condition has length > 1 and only the first element will be used
2: In if (method == "densityPeak") { :
  the condition has length > 1 and only the first element will be used

table(pData(my_cds_subset)$Cluster)

   1    2    3    4 
 229  405  175 1251

plot_cell_clusters(my_cds_subset)

Now we'll perform the differential gene expression analysis as before but across all cell clusters.

clustering_DEG_genes <- differentialGeneTest(my_cds_subset,
                                             fullModelFormulaStr = '~Cluster',
                                             cores = 8)
Warning messages:
1: In inherits(x, "ordered") :
  closing unused connection 10 (<-localhost:11278)
2: In inherits(x, "ordered") :
  closing unused connection 9 (<-localhost:11278)
3: In inherits(x, "ordered") :
  closing unused connection 8 (<-localhost:11278)
4: In inherits(x, "ordered") :
  closing unused connection 7 (<-localhost:11278)
5: In inherits(x, "ordered") :
  closing unused connection 6 (<-localhost:11278)
6: In inherits(x, "ordered") :
  closing unused connection 5 (<-localhost:11278)
7: In inherits(x, "ordered") :
  closing unused connection 4 (<-localhost:11278)
8: In inherits(x, "ordered") :
  closing unused connection 3 (<-localhost:11278)

dim(clustering_DEG_genes)
[1] 15446     8

clustering_DEG_genes %>% arrange(qval) %>% head()
  status           family          pval          qval              id gene_short_name num_cells_expressed use_for_ordering
1     OK negbinomial.size  0.000000e+00  0.000000e+00 ENSG00000143546          S100A8                2034             TRUE
2     OK negbinomial.size  0.000000e+00  0.000000e+00 ENSG00000163220          S100A9                2049             TRUE
3     OK negbinomial.size  0.000000e+00  0.000000e+00 ENSG00000163221         S100A12                1740             TRUE
4     OK negbinomial.size  0.000000e+00  0.000000e+00 ENSG00000203747          FCGR3A                 358             TRUE
5     OK negbinomial.size 1.106361e-319 3.417771e-316 ENSG00000231389        HLA-DPA1                1671             TRUE
6     OK negbinomial.size 2.237553e-312 5.760207e-309 ENSG00000090382             LYZ                2057             TRUE

We'll use the top 1,000 most significantly differentially expressed genes as the set of ordering genes and perform the dimension reduction and the trajectory analysis (using the orderCells() function).

my_ordering_genes <- row.names(clustering_DEG_genes)[order(clustering_DEG_genes$qval)][1:1000]
my_cds_subset <- setOrderingFilter(my_cds_subset, ordering_genes = my_ordering_genes)
my_cds_subset <- reduceDimension(my_cds_subset, method = 'DDRTree')

# the warnings were for use of deprecated code
my_cds_subset <- orderCells(my_cds_subset)
There were 50 or more warnings (use warnings() to see the first 50)

plot_cell_trajectory(my_cds_subset, color_by = "Cluster")

Finding Genes that Change as a Function of Pseudotime

Once we have a trajectory, we can use differentialGeneTest() to find genes that have an expression pattern that varies according to pseudotime.

# pseudotime is now a column in the phenotypic data as well as the cell state
head(pData(my_cds_subset))
                              barcode Size_Factor num_genes_expressed Cluster peaks  halo     delta      rho test my_colour
AAACCTGAGCTAACTC-1 AAACCTGAGCTAACTC-1   0.4014116                 806       4 FALSE  TRUE 0.5737701 35.64480   no      TRUE
AAACCTGCACTGTTAG-1 AAACCTGCACTGTTAG-1   1.1058961                1523       2 FALSE  TRUE 0.7596041 38.15302  yes      TRUE
AAACCTGCATAGTAAG-1 AAACCTGCATAGTAAG-1   1.0521060                1495       4 FALSE  TRUE 0.5180008 33.71503   no      TRUE
AAACCTGGTAGAAGGA-1 AAACCTGGTAGAAGGA-1   1.3016257                1628       2 FALSE  TRUE 0.4064823 34.54570  yes      TRUE
AAACCTGGTTTGCATG-1 AAACCTGGTTTGCATG-1   1.5070707                1899       2 FALSE  TRUE 0.3787114 35.82955  yes      TRUE
AAACCTGTCGTGGACC-1 AAACCTGTCGTGGACC-1   1.0878871                1321       3 FALSE FALSE 0.6943011 37.86991  yes      TRUE
                   Pseudotime State
AAACCTGAGCTAACTC-1  17.595545     5
AAACCTGCACTGTTAG-1   6.930170     1
AAACCTGCATAGTAAG-1  15.136582     5
AAACCTGGTAGAAGGA-1   4.916377     1
AAACCTGGTTTGCATG-1   5.139131     1
AAACCTGTCGTGGACC-1  12.394433     4

my_pseudotime_de <- differentialGeneTest(my_cds_subset,
                                         fullModelFormulaStr = "~sm.ns(Pseudotime)",
                                         cores = 8)

my_pseudotime_de %>% arrange(qval) %>% head()
  status           family pval qval              id gene_short_name num_cells_expressed use_for_ordering
1     OK negbinomial.size    0    0 ENSG00000143546          S100A8                2034             TRUE
2     OK negbinomial.size    0    0 ENSG00000163220          S100A9                2049             TRUE
3     OK negbinomial.size    0    0 ENSG00000163221         S100A12                1740             TRUE
4     OK negbinomial.size    0    0 ENSG00000203747          FCGR3A                 358             TRUE
5     OK negbinomial.size    0    0 ENSG00000231389        HLA-DPA1                1671             TRUE
6     OK negbinomial.size    0    0 ENSG00000257764   RP11-1143G9.4                1896             TRUE

# save the top 6 genes
my_pseudotime_de %>% arrange(qval) %>% head() %>% select(id) -> my_pseudotime_gene
my_pseudotime_gene <- my_pseudotime_gene$id

plot_genes_in_pseudotime(my_cds_subset[my_pseudotime_gene,])

Clustering Genes by Pseudotemporal Expression Pattern

Monocle provides functionality for clustering genes according to their pseudotime value. The clustering analysis and plotting are done in one step using the plot_pseudotime_heatmap() function; to save the clustering results, use return_heatmap = TRUE.

# cluster the top 50 genes that vary as a function of pseudotime
my_pseudotime_de %>% arrange(qval) %>% head(50) %>% select(id) -> gene_to_cluster
gene_to_cluster <- gene_to_cluster$id

my_pseudotime_cluster <- plot_pseudotime_heatmap(my_cds_subset[gene_to_cluster,],
                                                 num_clusters = 3,
                                                 cores = 8,
                                                 show_rownames = TRUE,
                                                 return_heatmap = TRUE)

The columns of the heatmap are pseudotime values binned into 100 bins. Here's the source code:

newdata <- data.frame(Pseudotime = seq(min(pData(my_cds_subset)$Pseudotime), 
                                       max(pData(my_cds_subset)$Pseudotime), length.out = 100))

Since we saved the results in my_pseudotime_cluster, we can extract the genes for each cluster.

# hierarchical clustering was used to cluster the genes
# we can cut the dendrogram to form the same 3 clusters as plot_pseudotime_heatmap
my_cluster <- cutree(my_pseudotime_cluster$tree_row, 3)
my_cluster
ENSG00000143546 ENSG00000163220 ENSG00000163221 ENSG00000203747 ENSG00000231389 ENSG00000257764 ENSG00000163737 ENSG00000155366 
              1               1               1               2               3               1               2               2 
ENSG00000163736 ENSG00000090382 ENSG00000038427 ENSG00000223865 ENSG00000019582 ENSG00000169429 ENSG00000196735 ENSG00000142089 
              2               1               1               3               3               1               3               2 
ENSG00000187109 ENSG00000185201 ENSG00000168497 ENSG00000166927 ENSG00000129757 ENSG00000188290 ENSG00000197353 ENSG00000163563 
              2               2               2               2               2               2               2               1 
ENSG00000123416 ENSG00000127920 ENSG00000180573 ENSG00000214548 ENSG00000101162 ENSG00000196126 ENSG00000170345 ENSG00000103187 
              2               2               2               2               2               3               1               2 
ENSG00000204525 ENSG00000130066 ENSG00000130522 ENSG00000158517 ENSG00000125898 ENSG00000108798 ENSG00000065978 ENSG00000173372 
              3               2               1               1               2               2               2               2 
ENSG00000170873 ENSG00000170458 ENSG00000212907 ENSG00000234745 ENSG00000204287 ENSG00000171611 ENSG00000132965 ENSG00000162444 
              2               1               1               3               3               2               1               1 
ENSG00000105372 ENSG00000148737 
              2               2 

# genes in cluster 1
my_pseudotime_de[names(my_cluster[my_cluster == 1]),"gene_short_name"]
 [1] "S100A8"        "S100A9"        "S100A12"       "RP11-1143G9.4" "LYZ"           "VCAN"          "CXCL8"        
 [8] "MNDA"          "FOS"           "JUND"          "NCF1"          "CD14"          "MT-ND4L"       "ALOX5AP"      
[15] "RBP7"

# genes in cluster 2
my_pseudotime_de[names(my_cluster[my_cluster == 2]),"gene_short_name"]
 [1] "FCGR3A"    "PF4"       "RHOC"      "PPBP"      "IFITM3"    "NAP1L1"    "IFITM2"    "SDPR"      "MS4A7"     "CDKN1C"   
[11] "HES4"      "LYPD2"     "TUBA1B"    "GNG11"     "HIST1H2AC" "MEG3"      "TUBB1"     "COTL1"     "SAT1"      "FAM110A"  
[21] "ABI3"      "YBX1"      "C1QA"      "MTSS1"     "PTCRA"     "RPS19"     "TCF7L2"

# genes in cluster 3
my_pseudotime_de[names(my_cluster[my_cluster == 3]),"gene_short_name"]
[1] "HLA-DPA1" "HLA-DPB1" "CD74"     "HLA-DQA1" "HLA-DRB1" "HLA-C"    "HLA-B"    "HLA-DRA"

Analyzing Branches in Single-Cell Trajectories

In our trajectory, we have two branches, which represents cells that have alternative gene expression patterns. These represent cells that have supposedly gone through different developmental paths. Monocle provides functions that allows you to identify the genes that differ at a particular branch point. Here is the trajectory again.

plot_cell_trajectory(my_cds_subset, color_by = "Cluster")

The BEAM() function takes a CellDataSet that has been ordered with orderCells() and a branch point in the trajectory. A table of genes is returned with significance values that indicate whether genes have expression patterns that are branch dependent.

# warnings not shown
BEAM_res <- BEAM(my_cds_subset, branch_point = 1, cores = 8)
BEAM_res <- BEAM_res[order(BEAM_res$qval),]
BEAM_res <- BEAM_res[,c("gene_short_name", "pval", "qval")]

# check out the results
head(BEAM_res)
                gene_short_name          pval          qval
ENSG00000143546          S100A8 5.715373e-202 8.388453e-198
ENSG00000163220          S100A9 2.170305e-139 1.592678e-135
ENSG00000163221         S100A12 3.222573e-113 1.576590e-109
ENSG00000223865        HLA-DPB1  4.574242e-93  1.678404e-89
ENSG00000196735        HLA-DQA1  6.264157e-78  1.838781e-74
ENSG00000196126        HLA-DRB1  2.437269e-77  5.961965e-74

table(BEAM_res$qval < 1e-4)

FALSE  TRUE 
15230   216

my_branched_heatmap <- plot_genes_branched_heatmap(my_cds_subset[row.names(subset(BEAM_res, qval < 1e-4)),],
                                                   branch_point = 1,
                                                   num_clusters = 4,
                                                   cores = 8,
                                                   use_gene_short_name = TRUE,
                                                   show_rownames = TRUE,
                                                   return_heatmap = TRUE)

The heatmap shows how some genes are over-expressed or under-expressed depending on the trajectory path.

We can return genes that belong to specific clusters that were identified by BEAM().

head(my_branched_heatmap$annotation_row)
         Cluster
S100A8         1
S100A9         1
S100A12        1
HLA-DPB1       2
HLA-DQA1       2
HLA-DRB1       2

dim(my_branched_heatmap$annotation_row)
[1] 216   1

table(my_branched_heatmap$annotation_row$Cluster)

 1  2  3  4 
67 32 87 30

my_row <- my_branched_heatmap$annotation_row
my_row <- data.frame(cluster = my_row$Cluster,
                     gene = row.names(my_row),
                     stringsAsFactors = FALSE)

head(my_row[my_row$cluster == 3,'gene'])
[1] "RPL3"   "RPS12"  "RPL5"   "EEF1A1" "RPSA"   "RPS6"

my_gene <- row.names(subset(fData(my_cds_subset),
                            gene_short_name %in% head(my_row[my_row$cluster == 3,'gene'])))

# plot genes that are expressed in a branch dependent manner
plot_genes_branched_pseudotime(my_cds_subset[my_gene,],
                               branch_point = 1,
                               ncol = 1)

The trend lines show the expression pattern of genes along the paths formed by branch point 1.

Print Friendly, PDF & Email



Creative Commons License
This work is licensed under a Creative Commons
Attribution 4.0 International License
.
15 comments Add yours
  1. Thanks for your article,I have a question about how to confirm the number of cluster and dim, it is so subjective.Thank you

  2. Thank you so much Mr Tang! I have learned a lot from your blog to use monocle, seurat, intersect and so on. Your blog is soooo helpful ! Thanks again!

  3. Mr. Tang, Thank you very much!
    I am trying to follow your blog but I noticed little bit problem.
    Could you suggest me in this regard ?
    —– plot_pc_variance_explained(my_cds, return_all = F) # norm_method=’log’
    Problem is the above command is not executing .

    > plot_pc_variance_explained(my_cds, return_all = F)
    Error in (function (A, nv = 5, nu = nv, maxit = 100, work = nv + 7, reorth = TRUE, :
    BLAS/LAPACK routine ‘DLASCL’ gave error code -4

    I don’t know why.??

    1. the package of irlba or sth. related not installed complitely
      for irlba, a matrix computing package, you need to install the lastest github version ,then monocle can use it

  4. Hi Dave,

    Do you know I would go about doing differential expression analysis between two samples and not between clusters?

    Cheers

  5. Hi, Dave

    I really appreciated your awesome tutorial. I have one question about plot_cell_clusters. It seems that the arguments in this function such as markers, cell_size have been deprecated. I followed the instruction in Monocle2 website, but I can not make a plot like they did. For example, the authors can map the size of the dots to the log expression value and map the color of the dots according to CellType, in the mean time showing two markers in two separate plots. My R version is 3.8 and my monocle version is 2.8.0. In your tutorial, you only add the CellDataSet object into the function without any other arguments. I am not sure whether it is because you have the same problem or not.

    Thanks, hope to hear from you soon.

    Jiarui

    1. Great post!

      What about if u have cases and controls and u want to follow this protocol ?

  6. Hi Dave,

    Thank you for your tutorial on Monocle. It’s very useful.
    I don’t quite understand the meaning or use of rho and delta when you do the clustering.
    Could you explain it a bit more for me?

    Thank you very much.

  7. Thank you so much for this tutorial! Very helpful! Above you said “We could remove the cells with much higher gene (and UMI) counts as they might be doublets but I’ll include them in this analysis.”

    I’d like to remove cells with very low UMI counts and very high UMI counts, to remove dead cells and doubles. With Seurat, I exclude less than 2000 and above 6000. How do you do this in monocle?

    Thanks!

  8. Under Constructing Single Cell Trajectories section:
    This line of code does not run:
    my_cds_subset <- reduceDimension(my_cds_subset,
    max_components = 2,
    norm_method = 'log',
    num_dim = 10,
    reduction_method = 'tSNE',
    verbose = TRUE)
    if my_cds_subset is generated by: my_cds_subset <- my_cds[expressed_genes, pData(my_cds)$my_colour]; then the error is: Error in `$<-.data.frame`(`*tmp*`, "tsne_1", value = c(-4.94368917177606, : replacement has 8381 rows, data has 2425. I then manually re-level all the factor in pData(my_cds_subset) it still shows this error.

    Next I tried to save the cluster 1 and 3 cell barcodes, and re-build the my_cds_subset monocle object from the raw matrix (subsetting with the barcode I saved to generate the exprs matrix and pData). The new object is 15446 features, 2425 samples ; next I rerun the whole monocle pipeline, and when hitting the reduceDimensions() step, the error is now always: Error in .check_tsne_params(nrow(X), dims = dims, perplexity = perplexity, : perplexity is too large for the number of samples
    I changed the perplexity to 1, still show the error..

    Any suggestion? or Any one has reached beyond the reduceDimension() on my_cds_subset?
    Thanks a lot.

  9. Dear Dave and all,
    the code is old.
    And `load_cellranger_matrix()`does strange stuff (like adding some subfolder names and looking there). This was quite annoying. So I went to github site of cellrangerRkit, looked into code, and created a hacked version of `load_cellranger_matrix()`:

    “`
    load_cellranger_matrix <- function(data_dir,
    summary_fn=NULL) {
    mat_fn <- file.path(data_dir, "matrix.mtx")
    gene_fn <- file.path(data_dir, "genes.tsv")
    barcode_fn <- file.path(data_dir, "barcodes.tsv")
    mat <- readMM(mat_fn)
    gene_info <- read.delim(gene_fn, stringsAsFactors=FALSE, sep="\t", header=FALSE)
    barcodes <- read.delim(barcode_fn, stringsAsFactors=FALSE, sep="\t", header=FALSE)

    if (dim(mat)[1] != dim(gene_info)[1]) {
    stop(sprintf("Mismatch dimension between gene file: \n\t %s\n and matrix file: \n\t %s\n",
    gene_fn,
    mat_fn))
    }
    if (dim(mat)[2] != dim(barcodes)[1]) {
    stop(sprintf("Mismatch dimension between gene file: \n\t %s\n and matrix file: \n\t %s\n",
    barcode_fn,
    mat_fn))
    }
    rownames(mat) <- gene_info[, 1]
    colnames(mat) <- barcodes[, 1]
    gene_symbols <- gene_info
    row.names(gene_symbols) <- gene_symbols[, 1]
    colnames(gene_symbols) <- c("id", "symbols")
    pd <- data.frame(id=barcodes[,1], row.names=barcodes[,1])
    colnames(pd) <- c("barcode")

    res <- newGeneBCMatrix(mat=mat, fd=gene_symbols, pd=pd)
    res@subsampled <- FALSE
    if (is.null(summary_fn)) {
    warning("No summary file provided. Some functions may be disabled without the metrics in summary csv.\n")
    } else {
    summary <- read.csv(summary_fn, as.is=TRUE, header=TRUE)
    res@summary <- summary
    }
    res
    }
    “`

    With this function,
    “`
    gbm <- load_cellranger_matrix(data_dir)
    class(gbm)
    “`
    works now flawlessly!

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.