# Getting started with Monocle

Monocle is an R package developed for analysing single cell gene expression data. Specifically, the package provides functionality for clustering and classifying single cells, conducting differential expression analyses, and constructing and investigating inferred developmental trajectories. The toolkit provides various alternative approaches for each analysis, hence your workflow may differ from the approach I've taken in this post.

I will use an open dataset provided by 10x Genomics, which was generated from Peripheral Blood Mononuclear Cells (PBMCs) from a healthy donor. PBMCs are primary cells, which are cells isolated directly from tissue using enzymatic or mechanical methods, with relatively small amounts of RNA (around 1pg RNA per cell). The data was sequenced on an Illumina HiSeq 4000 with approximately 92,000 reads per 8,381 cells. This dataset was processed with Cell Ranger 2.0.1 with the parameter --cells=10000.

To get started install Monocle from Bioconductor.

source("http://bioconductor.org/biocLite.R")
biocLite("monocle")

library(monocle)

# version used for this post
packageVersion('monocle')
[1] ‘2.4.0’


wget -c http://cf.10xgenomics.com/samples/cell-exp/2.0.1/pbmc8k/pbmc8k_filtered_gene_bc_matrices.tar.gz

# on OS X it's md5
# on other Linux distros it is typically md5sum
md5 pbmc8k_filtered_gene_bc_matrices.tar.gz
MD5 (pbmc8k_filtered_gene_bc_matrices.tar.gz) = 1ddd33d2b3de26b6d058fba089a72614


We are going to use the Cell Ranger R Kit package to read the 10X single cell data into R. Install it if you haven't already.

# see http://cf.10xgenomics.com/supp/cell-exp/cellrangerrkit-PBMC-vignette-knitr-2.0.0.pdf
source("http://cf.10xgenomics.com/supp/cell-exp/rkit-install-2.0.0.R")

library(cellrangerRkit)

# version used for this post
packageVersion("cellrangerRkit")
[1] ‘2.0.0’


The load_cellranger_matrix() function in the cellrangerRkit is used to load the downloaded single cell data into R. However, the function expects a certain directory structure, so we need to move the filtered_gene_bc_matrices directory into an outs directory.

# extract
tar -xzf pbmc8k_filtered_gene_bc_matrices.tar.gz

mkdir outs
mv filtered_gene_bc_matrices/ outs/


Now we're ready to load the pbmc8k data into R. Note that I have set the variable my_dir to the directory of where the downloaded dataset is located; modify this accordingly.

# change this according to where you downloaded and extracted the data
my_dir <- "~/muse/monocle/pbmc8k/"

Searching for genomes in: ~/muse/monocle/pbmc8k//outs/filtered_gene_bc_matrices
Using GRCh38 in folder: ~/muse/monocle/pbmc8k//outs/filtered_gene_bc_matrices/GRCh38
Could not find summary csv:
~/muse/monocle/pbmc8k//outs/metrics_summary.csv.
This file is only necessary if you are performing depth-normalization (calling the equalize_gbms function) in R.
If this pipestance was produced by cellranger aggr with the default parameters, depth-normalization in R (via equalize_gbms) is not necessary.

# what class is the object?
class(gbm)
[1] "GeneBCMatrix"
attr(,"package")
[1] "cellrangerRkit"


Conveniently, we can use the exprs(), pData(), and fData() functions to retrieve the expression, experimental phenotypes (phenoData), and feature information, respectively, from the gbm object. These functions are from the Bioconductor ExpressionSet class, which is handy because the CellDataSet class used in Monocle is also derived from this class.

# check out the expression data using exprs()
# 33,694 genes across 8,381 single cells
dim(exprs(gbm))
[1] 33694  8381

# nothing exciting going on in the first five genes in the first five cells
exprs(gbm)[1:5, 1:5]
5 x 5 sparse Matrix of class "dgTMatrix"
AAACCTGAGCATCATC-1 AAACCTGAGCTAACTC-1 AAACCTGAGCTAGTGG-1 AAACCTGCACATTAGC-1 AAACCTGCACTGTTAG-1
ENSG00000243485                  .                  .                  .                  .                  .
ENSG00000237613                  .                  .                  .                  .                  .
ENSG00000186092                  .                  .                  .                  .                  .
ENSG00000238009                  .                  .                  .                  .                  .
ENSG00000239945                  .                  .                  .                  .                  .

# check out the phenotypic data
dim(pData(gbm))
[1] 8381    1

barcode
AAACCTGAGCATCATC-1 AAACCTGAGCATCATC-1
AAACCTGAGCTAACTC-1 AAACCTGAGCTAACTC-1
AAACCTGAGCTAGTGG-1 AAACCTGAGCTAGTGG-1
AAACCTGCACATTAGC-1 AAACCTGCACATTAGC-1
AAACCTGCACTGTTAG-1 AAACCTGCACTGTTAG-1
AAACCTGCATAGTAAG-1 AAACCTGCATAGTAAG-1

# check out the feature information
dim(fData(gbm))
[1] 33694     2

# this table will be useful for matching gene IDs to symbols
id        symbol
ENSG00000243485 ENSG00000243485  RP11-34P13.3
ENSG00000237613 ENSG00000237613       FAM138A
ENSG00000186092 ENSG00000186092         OR4F5
ENSG00000238009 ENSG00000238009  RP11-34P13.7
ENSG00000239945 ENSG00000239945  RP11-34P13.8
ENSG00000239906 ENSG00000239906 RP11-34P13.14


The main object class in Monocle is the CellDataSet object; to get started we need to create a CellDataSet object with the newCellDataSet() function. The CellDataSet object was derived from the ExpressionSet class, so it's easy to create, since the gbm object was also derived from the same class. However, Monocle expects that the gene symbol column in the feature data is called gene_short_name or else you will get a warning.

# warning!
my_cds <- newCellDataSet(exprs(gbm),
phenoData = new("AnnotatedDataFrame", data = pData(gbm)),
featureData = new("AnnotatedDataFrame", data = fData(gbm)),
lowerDetectionLimit = 0.5,
expressionFamily = negbinomial.size())

Warning message:
In newCellDataSet(exprs(gbm), phenoData = new("AnnotatedDataFrame",  :
None of your featureData columns are named 'gene_short_name', some functions will not be able
to take this function as input as a result

# rename gene symbol column
my_feat <- fData(gbm)
names(my_feat) <- c('id', 'gene_short_name')

# no warning
my_cds <- newCellDataSet(exprs(gbm),
phenoData = new("AnnotatedDataFrame", data = pData(gbm)),
featureData = new("AnnotatedDataFrame", data = my_feat),
lowerDetectionLimit = 0.5,
expressionFamily = negbinomial.size())

my_cds
CellDataSet (storageMode: environment)
assayData: 33694 features, 8381 samples
element names: exprs
protocolData: none
phenoData
sampleNames: AAACCTGAGCATCATC-1 AAACCTGAGCTAACTC-1 ... TTTGTCATCTGCTTGC-1 (8381 total)
varLabels: barcode Size_Factor
featureData
featureNames: ENSG00000243485 ENSG00000237613 ... ENSG00000268674 (33694 total)
fvarLabels: id gene_short_name
experimentData: use 'experimentData(object)'
Annotation:

slotNames(my_cds)
[1] "reducedDimS"           "reducedDimW"           "reducedDimA"           "reducedDimK"
[5] "minSpanningTree"       "cellPairwiseDistances" "expressionFamily"      "lowerDetectionLimit"
[9] "dispFitInfo"           "dim_reduce_type"       "auxOrderingData"       "auxClusteringData"
[13] "experimentData"        "assayData"             "phenoData"             "featureData"
[17] "annotation"            "protocolData"          ".__classVersion__"


It is important that the appropriate expression family is used when creating a new CellDataSet. We have UMI data (that is quite large), so we'll use negbinomial.size(). If your data contains relative counts (e.g. FPKM or TPM values), you can use relative2abs() to convert these measurements into absolute counts.

Next, we'll perform some normalisation and variance estimation steps, which will be used in the differential expression analyses later on.

my_cds <- estimateSizeFactors(my_cds)
my_cds <- estimateDispersions(my_cds)
Removing 146 outliers
Warning message:


Now let's explore the data! First we will need to run the detectGenes() function, which tallies the number of cells expressing a gene and the number of genes expressed among all cells. This will be clearer when you see the tables.

my_cds <- detectGenes(my_cds, min_expr = 0.1)

id gene_short_name num_cells_expressed
ENSG00000243485 ENSG00000243485    RP11-34P13.3                   0
ENSG00000237613 ENSG00000237613         FAM138A                   0
ENSG00000186092 ENSG00000186092           OR4F5                   0
ENSG00000238009 ENSG00000238009    RP11-34P13.7                  24
ENSG00000239945 ENSG00000239945    RP11-34P13.8                   1
ENSG00000239906 ENSG00000239906   RP11-34P13.14                   0

summary(fData(my_cds)$num_cells_expressed) Min. 1st Qu. Median Mean 3rd Qu. Max. 0.0 0.0 5.0 349.9 234.0 8381.0  The num_cells_expressed column is a tally of the number of cells expressing a particular gene (a gene is "expressed" if there is at least one count since we set min_expr = 0.1). For example if we tally ENSG00000239945 across all cells, we should get 1, as indicated in the table above. sum((exprs(my_cds['ENSG00000239945',]))) [1] 1 sum((exprs(my_cds['ENSG00000238009',]))) [1] 24  The number of genes expressed (num_genes_expressed) per cell is stored in phenoData. Note that if a gene has 10 UMIs or 1 UMI, it is still tallied as 1. head(pData(my_cds)) barcode Size_Factor num_genes_expressed AAACCTGAGCATCATC-1 AAACCTGAGCATCATC-1 0.5672842 871 AAACCTGAGCTAACTC-1 AAACCTGAGCTAACTC-1 0.4014116 806 AAACCTGAGCTAGTGG-1 AAACCTGAGCTAGTGG-1 1.0710628 1316 AAACCTGCACATTAGC-1 AAACCTGCACATTAGC-1 0.6606467 898 AAACCTGCACTGTTAG-1 AAACCTGCACTGTTAG-1 1.1058961 1526 AAACCTGCATAGTAAG-1 AAACCTGCATAGTAAG-1 1.0521060 1495 # sanity check for AAACCTGAGCATCATC-1 sum((exprs(my_cds)[,"AAACCTGAGCATCATC-1"])>0) [1] 871 summary(pData(my_cds)$num_genes_expressed)
Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
502    1111    1297    1407    1548    4800

# standardise to Z-distribution
x <- pData(my_cds)$num_genes_expressed x_1 <- (x - mean(x)) / sd(x) summary(x_1) Min. 1st Qu. Median Mean 3rd Qu. Max. -1.8839 -0.6155 -0.2282 0.0000 0.2946 7.0675 library(ggplot2) # I like the default theme of cowplot library(cowplot) df <- data.frame(x = x_1) ggplot(df, aes(x)) + geom_histogram(bins = 50) + geom_vline(xintercept = c(-2, 2), linetype = "dotted", color = 'red')  Only a few cells have an abnormal number of expressed genes, such as the ones that are 5 standard deviations away from the mean. We can also add our own metadata, such as the UMI count, to the phenoData. # add a UMI column into phenoData # use Matrix because data is in the sparse format pData(my_cds)$UMI <- Matrix::colSums(exprs(my_cds))

barcode Size_Factor num_genes_expressed  UMI
AAACCTGAGCATCATC-1 AAACCTGAGCATCATC-1   0.5672842                 871 2394
AAACCTGAGCTAACTC-1 AAACCTGAGCTAACTC-1   0.4014116                 806 1694
AAACCTGAGCTAGTGG-1 AAACCTGAGCTAGTGG-1   1.0710628                1316 4520
AAACCTGCACATTAGC-1 AAACCTGCACATTAGC-1   0.6606467                 898 2788
AAACCTGCACTGTTAG-1 AAACCTGCACTGTTAG-1   1.1058961                1526 4667
AAACCTGCATAGTAAG-1 AAACCTGCATAGTAAG-1   1.0521060                1495 4440

ggplot(pData(my_cds), aes(num_genes_expressed, UMI)) + geom_point()


We could remove the cells with much higher gene (and UMI) counts as they might be doublets but I'll include them in this analysis.

## Clustering cells without marker genes

Monocle provides several alternative approaches for clustering cells and the recommended approach is to use marker genes. However, since I don't know of any specific markers for this dataset I will use the unsupervised approach. The first step is determining a subset of genes to use for clustering; this is because not all genes are informative, such as those that are lowly expressed. The approach is to select gene based on their average expression and variability across cells. The dispersionTable() function calculates the mean and dispersion values.

disp_table <- dispersionTable(my_cds)
gene_id mean_expression dispersion_fit dispersion_empirical
1 ENSG00000238009    0.0026983425      52.863362          0.000000000
2 ENSG00000279457    0.1548382003       1.068659          0.356959667
3 ENSG00000228463    0.0543320140       2.767983          0.005359107
4 ENSG00000237094    0.0022958750      62.104025         30.358663597
5 ENSG00000230021    0.0002830376     502.693229          0.000000000
6 ENSG00000237491    0.0245526857       5.943231          0.000000000


We will select genes, which have a mean expression >= 0.1, to use in the clustering step. The setOrderingFilter() function allows us to indicate which genes we want to use for clustering. The plot_ordering_genes() function plots mean expression against the empirical dispersion and highlights the set of genes (as black dots) that will be used for clustering.

table(disp_table$mean_expression>=0.1) FALSE TRUE 15422 4012 unsup_clustering_genes <- subset(disp_table, mean_expression >= 0.1) my_cds <- setOrderingFilter(my_cds, unsup_clustering_genes$gene_id)
plot_ordering_genes(my_cds)
Warning messages:
1: Transformation introduced infinite values in continuous y-axis
2: Transformation introduced infinite values in continuous y-axis


It will take a lot of computational time to cluster 8,381 cells even when we have subsetted the total number of genes down to 4,012. The usual approach is to perform dimension reduction (typically Principal Component Analysis [PCA]) and use the principal components that explain the majority of the variance in the data. The plot_pc_variance_explained() function plots the percentage of variance explained by each component based on a PCA performed on the normalised expression data. From the plot we can decide the number of components to use in our downstream analyses.

plot_pc_variance_explained(my_cds, return_all = FALSE)
Warning in (function (A, nv = 5, nu = nv, maxit = 100, work = nv + 7, reorth = TRUE,  :
did not converge--results might be invlaid!; try increasing maxit or fastpath=FALSE


The goal is to identify the elbow in this plot, which can be a bit subjective.

The warning message is from the irlba() function, which was used to perform the PCA. The "results might be invlaid [sic]" warning is a bit worrying though, so I'm not sure whether or not the plot is valid. I'll perform dimension reduction using five and ten principal components (by setting the num_dim parameter, which is used as input to the prcomp_irlba() function) and compare the clustering results.

The clusterCells() function performs the clustering on a CellDataSet; however, you need to input the number of requested clusters (I arbitrarily chose 15). The results of the clustering are stored in the phenoData table.

# if num_dim is not set, the default is set to 50
# I'll use 5 here based on plot_pc_variance_explained
my_cds <- reduceDimension(my_cds, max_components = 2, num_dim = 5,
reduction_method = 'tSNE', verbose = TRUE)
Remove noise by PCA ...
Reduce dimension by tSNE ...

# perform unsupervised clustering requesting 15-1 clusters
my_cds <- clusterCells(my_cds, num_clusters = 15)
Distance cutoff calculated to 4.797959
Warning messages:
1: In if (method == "DDRTree") { :
the condition has length > 1 and only the first element will be used
2: In if (method == "densityPeak") { :
the condition has length > 1 and only the first element will be used

# cluster information is in the Cluster column
barcode Size_Factor num_genes_expressed Cluster peaks  halo      delta      rho
AAACCTGAGCATCATC-1 AAACCTGAGCATCATC-1   0.5672842                 871       4 FALSE FALSE 0.44962230 145.2827
AAACCTGAGCTAACTC-1 AAACCTGAGCTAACTC-1   0.4014116                 806      12 FALSE  TRUE 0.33860313 131.3374
AAACCTGAGCTAGTGG-1 AAACCTGAGCTAGTGG-1   1.0710628                1316       5 FALSE  TRUE 0.87331834 128.1797
AAACCTGCACATTAGC-1 AAACCTGCACATTAGC-1   0.6606467                 898       9 FALSE  TRUE 0.36330440 123.4466
AAACCTGCACTGTTAG-1 AAACCTGCACTGTTAG-1   1.1058961                1526      11 FALSE  TRUE 0.15344369 104.5582
AAACCTGCATAGTAAG-1 AAACCTGCATAGTAAG-1   1.0521060                1495      11 FALSE  TRUE 0.07492976 119.3500

# store cluster info for comparing later
my_cluster_dim_5 <- pData(my_cds)$Cluster plot_cell_clusters(my_cds)  Now with num_dim = 10. my_cds <- reduceDimension(my_cds, max_components = 2, num_dim = 10, reduction_method = 'tSNE', verbose = TRUE) Remove noise by PCA ... Reduce dimension by tSNE ... my_cds <- clusterCells(my_cds, num_clusters = 15) Distance cutoff calculated to 4.29865 Warning messages: 1: In if (method == "DDRTree") { : the condition has length > 1 and only the first element will be used 2: In if (method == "densityPeak") { : the condition has length > 1 and only the first element will be used my_cluster_dim_10 <- pData(my_cds)$Cluster

plot_cell_clusters(my_cds)


We'll use the Adjusted Rand Index (ARI) to compare the clustering results. The adjustedRand() function in the clues package calculates the Rand Index (RI) and different variations of the ARI (HA, MA, and FM).

library(clues)
Rand        HA        MA        FM   Jaccard
0.9047255 0.3825039 0.3833038 0.4349620 0.2774384


I wrote about why there is a difference between the RI and ARI. The ARI (HA) suggests that the clustering of cells is quite different when including different numbers of principal components. I also compared clustering results with num_dim = 4 and num_dim = 5 and while the ARI is higher than the comparison between num_dim = 5 and num_dim = 10, it still implies that the cell clusters are quite different.

# include 4 dimensions
my_cds <- reduceDimension(my_cds, max_components = 2, num_dim = 4,
reduction_method = 'tSNE', verbose = TRUE)
my_cds <- clusterCells(my_cds, num_clusters = 15)
my_cluster_dim_4 <- pData(my_cds)$Cluster # include 10 dimensions my_cds <- reduceDimension(my_cds, max_components = 2, num_dim = 5, reduction_method = 'tSNE', verbose = TRUE) my_cds <- clusterCells(my_cds, num_clusters = 15) my_cluster_dim_10 <- pData(my_cds)$Cluster

# compare clustering results using 4 and 5 principal components
Rand        HA        MA        FM   Jaccard
0.9244489 0.5236041 0.5242011 0.5650728 0.3936819


For the rest of the analysis, I will be using num_dim = 10 and num_clusters = 15.

# include 10 dimensions
my_cds <- reduceDimension(my_cds, max_components = 2, num_dim = 10,
reduction_method = 'tSNE', verbose = TRUE)
my_cds <- clusterCells(my_cds, num_clusters = 15)


## Differential expression

One of the main interests of single cell transcriptomics is the identification of novel markers for cell types. We can use differentialGeneTest(), which is quite a flexible function, to find marker genes by assessing each gene's expression level across the cells. The model formulae used in the function can use any column in the phenoData table, including columns that we manually add. Here's the description of how it works from the Monocle tutorial page:

Monocle's differential expression analysis works essentially by fitting two models to the expression values for each gene, working through each gene independently. The first of the two models is called the full model. This model is essentially a way of predicting the expression value of each gene in a given cell knowing only whether that cell is a fibroblast or a myoblast. The second model, called the reduced model, does the same thing, but it doesn't know the CellType for each cell. It has to come up with a reasonable prediction of the expression value for the gene that will be used for all the cells. Because the full model has more information about each cell, it will do a better job of predicting the expression of the gene in each cell. The question Monocle must answer for each gene is how much better the full model's prediction is than the reduced model's. The greater the improvement that comes from knowing the CellType of each cell, the more significant the differential expression result.

I'll create a vector that indicates whether a single was clustered in "Cluster 1" or not to identify genes differentially expressed in cluster 1 versus the other clusters.

# create vector of no's
my_vector <- rep('no', nrow(pData(my_cds)))

# change status to yes if the cell was in cluster 1
my_vector[pData(my_cds)$Cluster == 1] <- rep('yes', sum(pData(my_cds)$Cluster == 1))

pData(my_cds)$test <- my_vector head(pData(my_cds)) barcode Size_Factor num_genes_expressed Cluster peaks halo delta rho test my_colour AAACCTGAGCATCATC-1 AAACCTGAGCATCATC-1 0.5672842 871 7 FALSE FALSE 0.1914281 174.08627 no FALSE AAACCTGAGCTAACTC-1 AAACCTGAGCTAACTC-1 0.4014116 806 6 FALSE TRUE 0.1170417 142.70796 no TRUE AAACCTGAGCTAGTGG-1 AAACCTGAGCTAGTGG-1 1.0710628 1316 3 FALSE FALSE 0.9993782 149.51023 no FALSE AAACCTGCACATTAGC-1 AAACCTGCACATTAGC-1 0.6606467 898 12 FALSE TRUE 0.3166583 153.24806 no FALSE AAACCTGCACTGTTAG-1 AAACCTGCACTGTTAG-1 1.1058961 1526 1 FALSE TRUE 0.1818162 91.74234 yes TRUE AAACCTGCATAGTAAG-1 AAACCTGCATAGTAAG-1 1.0521060 1495 4 FALSE TRUE 0.7871118 117.33641 no TRUE  I'll perform the differential expression analysis on the subset of genes where the mean expression (across cells) was >= 0.1 (see section "Clustering cells without marker genes"). length(unsup_clustering_genes$gene_id)
[1] 4012

de_cluster_one <- differentialGeneTest(my_cds[unsup_clustering_genes$gene_id,], fullModelFormulaStr = '~test', cores = 8) # not sure why we have results only for 479 genes out of 4,012 dim(de_cluster_one) [1] 479 8 # order by q-value library(dplyr) de_cluster_one %>% arrange(qval) %>% head() status family pval qval id gene_short_name num_cells_expressed use_for_ordering 1 OK negbinomial.size 0.000000e+00 0.000000e+00 ENSG00000163131 CTSS 5547 TRUE 2 OK negbinomial.size 0.000000e+00 0.000000e+00 ENSG00000196126 HLA-DRB1 5303 TRUE 3 OK negbinomial.size 0.000000e+00 0.000000e+00 ENSG00000203747 FCGR3A 951 TRUE 4 OK negbinomial.size 0.000000e+00 0.000000e+00 ENSG00000204287 HLA-DRA 6195 TRUE 5 OK negbinomial.size 0.000000e+00 0.000000e+00 ENSG00000227507 LTB 6720 TRUE 6 OK negbinomial.size 3.219336e-244 2.483842e-242 ENSG00000182287 AP1S2 3697 TRUE  Let's check out CTSS (ENSG00000163131), which is the most statistically significant gene, using plot_genes_fitter(). plot_genes_jitter(my_cds['ENSG00000163131',], grouping = "Cluster")  CTSS is specific for cluster 1 but also for clusters 4 and 6. We count data hence the expression is on a discrete scale. We can use plot_cell_clusters() to highlight clusters 1, 4, and 6. # add another column to the phenotypic data # where TRUE means cluster membership in clusters 1, 4, or 6 pData(my_cds)$my_colour <- pData(my_cds)$Cluster == 1 | pData(my_cds)$Cluster == 4 | pData(my_cds)$Cluster == 6 plot_cell_clusters(my_cds, color_by = 'my_colour')  CTSS is overexpressed in clusters 1, 4, and 6, which are highlighted in turquoise. ## Constructing Single Cell Trajectories In Monocle, a single cell trajectory is the inferred developmental timeline of single cells. From the tutorial: Monocle uses an algorithm to learn the sequence of gene expression changes each cell must go through as part of a dynamic biological process. Once it has learned the overall "trajectory" of gene expression changes, Monocle can place each cell at its proper position in the trajectory. The "unit" used for the trajectory is pseudotime; a cell at the beginning of the trajectory, i.e. starting state, will have a lower pseudotime than cells at the end of the trajectory, i.e. end state. Specifically, it is: the distance between a cell and the start of the trajectory, measured along the shortest path. The trajectory's total length is defined in terms of the total amount of transcriptional change that a cell undergoes as it moves from the starting state to the end state. The trajectory analysis consists of three stages: 1. Choose genes that define progress 2. Reduce the dimensionality of the data 3. Order cells in pseudotime For the trajectory analysis in this post, I will use only a subset of all cells (cluster 1, 4, and 6) and genes that are expressed in at least 10 cells. expressed_genes <- row.names(subset(fData(my_cds), num_cells_expressed >= 10)) # my_colour is TRUE if a cell belongs to either cluster 1, 4, or 6 my_cds_subset <- my_cds[expressed_genes, pData(my_cds)$my_colour]

# 15,446 genes and 2,060 cells
my_cds_subset
CellDataSet (storageMode: environment)
assayData: 15446 features, 2060 samples
element names: exprs
protocolData: none
phenoData
sampleNames: AAACCTGAGCTAACTC-1 AAACCTGCACTGTTAG-1 ... TTTGTCATCGTCTGAA-1 (2060 total)
varLabels: barcode Size_Factor ... my_colour (10 total)
featureData
featureNames: ENSG00000238009 ENSG00000279457 ... ENSG00000271254 (15446 total)
fvarLabels: id gene_short_name num_cells_expressed use_for_ordering
experimentData: use 'experimentData(object)'
Annotation:


We'll use the recommended approach of ordering based on genes that differ between clusters. First, we'll perform another subset of genes, keeping only genes expressed in greater than 5% of cells.

my_cds_subset <- detectGenes(my_cds_subset, min_expr = 0.1)
fData(my_cds_subset)$use_for_ordering <- fData(my_cds_subset)$num_cells_expressed > 0.05 * ncol(my_cds_subset)

# how many genes are used?
table(fData(my_cds_subset)$use_for_ordering) FALSE TRUE 9141 6305 plot_pc_variance_explained(my_cds_subset, return_all = FALSE) Warning in (function (A, nv = 5, nu = nv, maxit = 100, work = nv + 7, reorth = TRUE, : did not converge--results might be invlaid!; try increasing maxit or fastpath=FALSE  There's the same warning message as before when running plot_pc_variance_explained(), so I'll just use num_dim = 10 and perform clustering as before (see section "Clustering cells without marker genes") but this time without specifying the number of clusters; we will use thresholds on the cell's local density (rho) and nearest distance (delta) to determine the number of clusters. my_cds_subset <- reduceDimension(my_cds_subset, max_components = 2, norm_method = 'log', num_dim = 10, reduction_method = 'tSNE', verbose = TRUE) my_cds_subset <- clusterCells(my_cds_subset, verbose = FALSE) Distance cutoff calculated to 3.551792 Warning messages: 1: In if (method == "DDRTree") { : the condition has length > 1 and only the first element will be used 2: In if (method == "densityPeak") { : the condition has length > 1 and only the first element will be used plot_rho_delta(my_cds_subset, rho_threshold = 2, delta_threshold = 10)  Scatter plot of rho versus delta. We'll use rho = 2 and delta = 10 to cluster the cells again. my_cds_subset <- clusterCells(my_cds_subset, rho_threshold = 2, delta_threshold = 10, skip_rho_sigma = T, verbose = FALSE) Warning messages: 1: In if (method == "DDRTree") { : the condition has length > 1 and only the first element will be used 2: In if (method == "densityPeak") { : the condition has length > 1 and only the first element will be used table(pData(my_cds_subset)$Cluster)

1    2    3    4
229  405  175 1251

plot_cell_clusters(my_cds_subset)


Now we'll perform the differential gene expression analysis as before but across all cell clusters.

clustering_DEG_genes <- differentialGeneTest(my_cds_subset,
fullModelFormulaStr = '~Cluster',
cores = 8)
Warning messages:
1: In inherits(x, "ordered") :
closing unused connection 10 (<-localhost:11278)
2: In inherits(x, "ordered") :
closing unused connection 9 (<-localhost:11278)
3: In inherits(x, "ordered") :
closing unused connection 8 (<-localhost:11278)
4: In inherits(x, "ordered") :
closing unused connection 7 (<-localhost:11278)
5: In inherits(x, "ordered") :
closing unused connection 6 (<-localhost:11278)
6: In inherits(x, "ordered") :
closing unused connection 5 (<-localhost:11278)
7: In inherits(x, "ordered") :
closing unused connection 4 (<-localhost:11278)
8: In inherits(x, "ordered") :
closing unused connection 3 (<-localhost:11278)

dim(clustering_DEG_genes)
[1] 15446     8

status           family          pval          qval              id gene_short_name num_cells_expressed use_for_ordering
1     OK negbinomial.size  0.000000e+00  0.000000e+00 ENSG00000143546          S100A8                2034             TRUE
2     OK negbinomial.size  0.000000e+00  0.000000e+00 ENSG00000163220          S100A9                2049             TRUE
3     OK negbinomial.size  0.000000e+00  0.000000e+00 ENSG00000163221         S100A12                1740             TRUE
4     OK negbinomial.size  0.000000e+00  0.000000e+00 ENSG00000203747          FCGR3A                 358             TRUE
5     OK negbinomial.size 1.106361e-319 3.417771e-316 ENSG00000231389        HLA-DPA1                1671             TRUE
6     OK negbinomial.size 2.237553e-312 5.760207e-309 ENSG00000090382             LYZ                2057             TRUE


We'll use the top 1,000 most significantly differentially expressed genes as the set of ordering genes and perform the dimension reduction and the trajectory analysis (using the orderCells() function).

my_ordering_genes <- row.names(clustering_DEG_genes)[order(clustering_DEG_genes$qval)][1:1000] my_cds_subset <- setOrderingFilter(my_cds_subset, ordering_genes = my_ordering_genes) my_cds_subset <- reduceDimension(my_cds_subset, method = 'DDRTree') # the warnings were for use of deprecated code my_cds_subset <- orderCells(my_cds_subset) There were 50 or more warnings (use warnings() to see the first 50) plot_cell_trajectory(my_cds_subset, color_by = "Cluster")  ## Finding Genes that Change as a Function of Pseudotime Once we have a trajectory, we can use differentialGeneTest() to find genes that have an expression pattern that varies according to pseudotime. # pseudotime is now a column in the phenotypic data as well as the cell state head(pData(my_cds_subset)) barcode Size_Factor num_genes_expressed Cluster peaks halo delta rho test my_colour AAACCTGAGCTAACTC-1 AAACCTGAGCTAACTC-1 0.4014116 806 4 FALSE TRUE 0.5737701 35.64480 no TRUE AAACCTGCACTGTTAG-1 AAACCTGCACTGTTAG-1 1.1058961 1523 2 FALSE TRUE 0.7596041 38.15302 yes TRUE AAACCTGCATAGTAAG-1 AAACCTGCATAGTAAG-1 1.0521060 1495 4 FALSE TRUE 0.5180008 33.71503 no TRUE AAACCTGGTAGAAGGA-1 AAACCTGGTAGAAGGA-1 1.3016257 1628 2 FALSE TRUE 0.4064823 34.54570 yes TRUE AAACCTGGTTTGCATG-1 AAACCTGGTTTGCATG-1 1.5070707 1899 2 FALSE TRUE 0.3787114 35.82955 yes TRUE AAACCTGTCGTGGACC-1 AAACCTGTCGTGGACC-1 1.0878871 1321 3 FALSE FALSE 0.6943011 37.86991 yes TRUE Pseudotime State AAACCTGAGCTAACTC-1 17.595545 5 AAACCTGCACTGTTAG-1 6.930170 1 AAACCTGCATAGTAAG-1 15.136582 5 AAACCTGGTAGAAGGA-1 4.916377 1 AAACCTGGTTTGCATG-1 5.139131 1 AAACCTGTCGTGGACC-1 12.394433 4 my_pseudotime_de <- differentialGeneTest(my_cds_subset, fullModelFormulaStr = "~sm.ns(Pseudotime)", cores = 8) my_pseudotime_de %>% arrange(qval) %>% head() status family pval qval id gene_short_name num_cells_expressed use_for_ordering 1 OK negbinomial.size 0 0 ENSG00000143546 S100A8 2034 TRUE 2 OK negbinomial.size 0 0 ENSG00000163220 S100A9 2049 TRUE 3 OK negbinomial.size 0 0 ENSG00000163221 S100A12 1740 TRUE 4 OK negbinomial.size 0 0 ENSG00000203747 FCGR3A 358 TRUE 5 OK negbinomial.size 0 0 ENSG00000231389 HLA-DPA1 1671 TRUE 6 OK negbinomial.size 0 0 ENSG00000257764 RP11-1143G9.4 1896 TRUE # save the top 6 genes my_pseudotime_de %>% arrange(qval) %>% head() %>% select(id) -> my_pseudotime_gene my_pseudotime_gene <- my_pseudotime_gene$id

plot_genes_in_pseudotime(my_cds_subset[my_pseudotime_gene,])


## Clustering Genes by Pseudotemporal Expression Pattern

Monocle provides functionality for clustering genes according to their pseudotime value. The clustering analysis and plotting are done in one step using the plot_pseudotime_heatmap() function; to save the clustering results, use return_heatmap = TRUE.

# cluster the top 50 genes that vary as a function of pseudotime
my_pseudotime_de %>% arrange(qval) %>% head(50) %>% select(id) -> gene_to_cluster
gene_to_cluster <- gene_to_cluster$id my_pseudotime_cluster <- plot_pseudotime_heatmap(my_cds_subset[gene_to_cluster,], num_clusters = 3, cores = 8, show_rownames = TRUE, return_heatmap = TRUE)  The columns of the heatmap are pseudotime values binned into 100 bins. Here's the source code: newdata <- data.frame(Pseudotime = seq(min(pData(my_cds_subset)$Pseudotime),
max(pData(my_cds_subset)$Pseudotime), length.out = 100))  Since we saved the results in my_pseudotime_cluster, we can extract the genes for each cluster. # hierarchical clustering was used to cluster the genes # we can cut the dendrogram to form the same 3 clusters as plot_pseudotime_heatmap my_cluster <- cutree(my_pseudotime_cluster$tree_row, 3)
my_cluster
ENSG00000143546 ENSG00000163220 ENSG00000163221 ENSG00000203747 ENSG00000231389 ENSG00000257764 ENSG00000163737 ENSG00000155366
1               1               1               2               3               1               2               2
ENSG00000163736 ENSG00000090382 ENSG00000038427 ENSG00000223865 ENSG00000019582 ENSG00000169429 ENSG00000196735 ENSG00000142089
2               1               1               3               3               1               3               2
ENSG00000187109 ENSG00000185201 ENSG00000168497 ENSG00000166927 ENSG00000129757 ENSG00000188290 ENSG00000197353 ENSG00000163563
2               2               2               2               2               2               2               1
ENSG00000123416 ENSG00000127920 ENSG00000180573 ENSG00000214548 ENSG00000101162 ENSG00000196126 ENSG00000170345 ENSG00000103187
2               2               2               2               2               3               1               2
ENSG00000204525 ENSG00000130066 ENSG00000130522 ENSG00000158517 ENSG00000125898 ENSG00000108798 ENSG00000065978 ENSG00000173372
3               2               1               1               2               2               2               2
ENSG00000170873 ENSG00000170458 ENSG00000212907 ENSG00000234745 ENSG00000204287 ENSG00000171611 ENSG00000132965 ENSG00000162444
2               1               1               3               3               2               1               1
ENSG00000105372 ENSG00000148737
2               2

# genes in cluster 1
my_pseudotime_de[names(my_cluster[my_cluster == 1]),"gene_short_name"]
[1] "S100A8"        "S100A9"        "S100A12"       "RP11-1143G9.4" "LYZ"           "VCAN"          "CXCL8"
[8] "MNDA"          "FOS"           "JUND"          "NCF1"          "CD14"          "MT-ND4L"       "ALOX5AP"
[15] "RBP7"

# genes in cluster 2
my_pseudotime_de[names(my_cluster[my_cluster == 2]),"gene_short_name"]
[1] "FCGR3A"    "PF4"       "RHOC"      "PPBP"      "IFITM3"    "NAP1L1"    "IFITM2"    "SDPR"      "MS4A7"     "CDKN1C"
[11] "HES4"      "LYPD2"     "TUBA1B"    "GNG11"     "HIST1H2AC" "MEG3"      "TUBB1"     "COTL1"     "SAT1"      "FAM110A"
[21] "ABI3"      "YBX1"      "C1QA"      "MTSS1"     "PTCRA"     "RPS19"     "TCF7L2"

# genes in cluster 3
my_pseudotime_de[names(my_cluster[my_cluster == 3]),"gene_short_name"]
[1] "HLA-DPA1" "HLA-DPB1" "CD74"     "HLA-DQA1" "HLA-DRB1" "HLA-C"    "HLA-B"    "HLA-DRA"


## Analyzing Branches in Single-Cell Trajectories

In our trajectory, we have two branches, which represents cells that have alternative gene expression patterns. These represent cells that have supposedly gone through different developmental paths. Monocle provides functions that allows you to identify the genes that differ at a particular branch point. Here is the trajectory again.

plot_cell_trajectory(my_cds_subset, color_by = "Cluster")


The BEAM() function takes a CellDataSet that has been ordered with orderCells() and a branch point in the trajectory. A table of genes is returned with significance values that indicate whether genes have expression patterns that are branch dependent.

# warnings not shown
BEAM_res <- BEAM(my_cds_subset, branch_point = 1, cores = 8)
BEAM_res <- BEAM_res[order(BEAM_res$qval),] BEAM_res <- BEAM_res[,c("gene_short_name", "pval", "qval")] # check out the results head(BEAM_res) gene_short_name pval qval ENSG00000143546 S100A8 5.715373e-202 8.388453e-198 ENSG00000163220 S100A9 2.170305e-139 1.592678e-135 ENSG00000163221 S100A12 3.222573e-113 1.576590e-109 ENSG00000223865 HLA-DPB1 4.574242e-93 1.678404e-89 ENSG00000196735 HLA-DQA1 6.264157e-78 1.838781e-74 ENSG00000196126 HLA-DRB1 2.437269e-77 5.961965e-74 table(BEAM_res$qval < 1e-4)

FALSE  TRUE
15230   216

my_branched_heatmap <- plot_genes_branched_heatmap(my_cds_subset[row.names(subset(BEAM_res, qval < 1e-4)),],
branch_point = 1,
num_clusters = 4,
cores = 8,
use_gene_short_name = TRUE,
show_rownames = TRUE,
return_heatmap = TRUE)


The heatmap shows how some genes are over-expressed or under-expressed depending on the trajectory path.

We can return genes that belong to specific clusters that were identified by BEAM().

head(my_branched_heatmap$annotation_row) Cluster S100A8 1 S100A9 1 S100A12 1 HLA-DPB1 2 HLA-DQA1 2 HLA-DRB1 2 dim(my_branched_heatmap$annotation_row)
[1] 216   1

table(my_branched_heatmap$annotation_row$Cluster)

1  2  3  4
67 32 87 30

my_row <- my_branched_heatmap$annotation_row my_row <- data.frame(cluster = my_row$Cluster,
gene = row.names(my_row),
stringsAsFactors = FALSE)

head(my_row[my_row$cluster == 3,'gene']) [1] "RPL3" "RPS12" "RPL5" "EEF1A1" "RPSA" "RPS6" my_gene <- row.names(subset(fData(my_cds_subset), gene_short_name %in% head(my_row[my_row$cluster == 3,'gene'])))

# plot genes that are expressed in a branch dependent manner
plot_genes_branched_pseudotime(my_cds_subset[my_gene,],
branch_point = 1,
ncol = 1)


The trend lines show the expression pattern of genes along the paths formed by branch point 1.

.
1. Kenny says:

Thanks for your article,I have a question about how to confirm the number of cluster and dim, it is so subjective.Thank you

2. Zoey says:

Thank you so much Mr Tang! I have learned a lot from your blog to use monocle, seurat, intersect and so on. Your blog is soooo helpful ! Thanks again!

3. Atul Singha says:

Mr. Tang, Thank you very much!
I am trying to follow your blog but I noticed little bit problem.
Could you suggest me in this regard ?
—– plot_pc_variance_explained(my_cds, return_all = F) # norm_method=’log’
Problem is the above command is not executing .

> plot_pc_variance_explained(my_cds, return_all = F)
Error in (function (A, nv = 5, nu = nv, maxit = 100, work = nv + 7, reorth = TRUE, :
BLAS/LAPACK routine ‘DLASCL’ gave error code -4

I don’t know why.??

1. shaorui says:

the package of irlba or sth. related not installed complitely
for irlba, a matrix computing package, you need to install the lastest github version ,then monocle can use it

4. Lucy says:

Hi Dave,

Do you know I would go about doing differential expression analysis between two samples and not between clusters?

Cheers

1. Lucy says:

Thank you! Will give it a go.
Cheers,
Lucy

5. JiaruiMi says:

Hi, Dave

I really appreciated your awesome tutorial. I have one question about plot_cell_clusters. It seems that the arguments in this function such as markers, cell_size have been deprecated. I followed the instruction in Monocle2 website, but I can not make a plot like they did. For example, the authors can map the size of the dots to the log expression value and map the color of the dots according to CellType, in the mean time showing two markers in two separate plots. My R version is 3.8 and my monocle version is 2.8.0. In your tutorial, you only add the CellDataSet object into the function without any other arguments. I am not sure whether it is because you have the same problem or not.

Thanks, hope to hear from you soon.

Jiarui

1. Gianluca says:

Great post!

What about if u have cases and controls and u want to follow this protocol ?

6. Benjy says:

Hi Dave,

Thank you for your tutorial on Monocle. It’s very useful.
I don’t quite understand the meaning or use of rho and delta when you do the clustering.
Could you explain it a bit more for me?

Thank you very much.

7. Tara says:

Thank you so much for this tutorial! Very helpful! Above you said “We could remove the cells with much higher gene (and UMI) counts as they might be doublets but I’ll include them in this analysis.”

I’d like to remove cells with very low UMI counts and very high UMI counts, to remove dead cells and doubles. With Seurat, I exclude less than 2000 and above 6000. How do you do this in monocle?

Thanks!

8. guanxn says:

Under Constructing Single Cell Trajectories section:
This line of code does not run:
my_cds_subset <- reduceDimension(my_cds_subset,
max_components = 2,
norm_method = 'log',
num_dim = 10,
reduction_method = 'tSNE',
verbose = TRUE)
if my_cds_subset is generated by: my_cds_subset <- my_cds[expressed_genes, pData(my_cds)$my_colour]; then the error is: Error in $<-.data.frame(*tmp*, "tsne_1", value = c(-4.94368917177606, : replacement has 8381 rows, data has 2425. I then manually re-level all the factor in pData(my_cds_subset) it still shows this error.

Next I tried to save the cluster 1 and 3 cell barcodes, and re-build the my_cds_subset monocle object from the raw matrix (subsetting with the barcode I saved to generate the exprs matrix and pData). The new object is 15446 features, 2425 samples ; next I rerun the whole monocle pipeline, and when hitting the reduceDimensions() step, the error is now always: Error in .check_tsne_params(nrow(X), dims = dims, perplexity = perplexity, : perplexity is too large for the number of samples
I changed the perplexity to 1, still show the error..

Any suggestion? or Any one has reached beyond the reduceDimension() on my_cds_subset?
Thanks a lot.

1. Haoran Li says:

I found there is a person answer :perplexity roughly equal to 5-50.

9. Gwang-Jin Kim says:

Dear Dave and all,
the code is old.
And load_cellranger_matrix()does strange stuff (like adding some subfolder names and looking there). This was quite annoying. So I went to github site of cellrangerRkit, looked into code, and created a hacked version of load_cellranger_matrix():

“
summary_fn=NULL) {
mat_fn <- file.path(data_dir, "matrix.mtx")
gene_fn <- file.path(data_dir, "genes.tsv")
barcode_fn <- file.path(data_dir, "barcodes.tsv")

if (dim(mat)[1] != dim(gene_info)[1]) {
stop(sprintf("Mismatch dimension between gene file: \n\t %s\n and matrix file: \n\t %s\n",
gene_fn,
mat_fn))
}
if (dim(mat)[2] != dim(barcodes)[1]) {
stop(sprintf("Mismatch dimension between gene file: \n\t %s\n and matrix file: \n\t %s\n",
barcode_fn,
mat_fn))
}
rownames(mat) <- gene_info[, 1]
colnames(mat) <- barcodes[, 1]
gene_symbols <- gene_info
row.names(gene_symbols) <- gene_symbols[, 1]
colnames(gene_symbols) <- c("id", "symbols")
pd <- data.frame(id=barcodes[,1], row.names=barcodes[,1])
colnames(pd) <- c("barcode")

res <- newGeneBCMatrix(mat=mat, fd=gene_symbols, pd=pd)
res@subsampled <- FALSE
if (is.null(summary_fn)) {
warning("No summary file provided. Some functions may be disabled without the metrics in summary csv.\n")
} else {
res@summary <- summary
}
res
}
“

With this function,
“