Monocle is an R package developed for analysing single cell gene expression data. Specifically, the package provides functionality for clustering and classifying single cells, conducting differential expression analyses, and constructing and investigating inferred developmental trajectories. The toolkit provides various alternative approaches for each analysis, hence your workflow may differ from the approach I've taken in this post.
I will use an open dataset provided by 10x Genomics, which was generated from Peripheral Blood Mononuclear Cells (PBMCs) from a healthy donor. PBMCs are primary cells, which are cells isolated directly from tissue using enzymatic or mechanical methods, with relatively small amounts of RNA (around 1pg RNA per cell). The data was sequenced on an Illumina HiSeq 4000 with approximately 92,000 reads per 8,381 cells. This dataset was processed with Cell Ranger 2.0.1 with the parameter --cells=10000.
To get started install Monocle from Bioconductor.
source("http://bioconductor.org/biocLite.R") biocLite("monocle") # load library library(monocle) # version used for this post packageVersion('monocle') [1] ‘2.4.0’
Now download the processed dataset (around 37M).
wget -c http://cf.10xgenomics.com/samples/cell-exp/2.0.1/pbmc8k/pbmc8k_filtered_gene_bc_matrices.tar.gz # on OS X it's md5 # on other Linux distros it is typically md5sum md5 pbmc8k_filtered_gene_bc_matrices.tar.gz MD5 (pbmc8k_filtered_gene_bc_matrices.tar.gz) = 1ddd33d2b3de26b6d058fba089a72614
We are going to use the Cell Ranger R Kit package to read the 10X single cell data into R. Install it if you haven't already.
# see http://cf.10xgenomics.com/supp/cell-exp/cellrangerrkit-PBMC-vignette-knitr-2.0.0.pdf source("http://cf.10xgenomics.com/supp/cell-exp/rkit-install-2.0.0.R") # load library library(cellrangerRkit) # version used for this post packageVersion("cellrangerRkit") [1] ‘2.0.0’
The load_cellranger_matrix() function in the cellrangerRkit is used to load the downloaded single cell data into R. However, the function expects a certain directory structure, so we need to move the filtered_gene_bc_matrices directory into an outs directory.
# extract tar -xzf pbmc8k_filtered_gene_bc_matrices.tar.gz mkdir outs mv filtered_gene_bc_matrices/ outs/
Now we're ready to load the pbmc8k data into R. Note that I have set the variable my_dir to the directory of where the downloaded dataset is located; modify this accordingly.
# change this according to where you downloaded and extracted the data my_dir <- "~/muse/monocle/pbmc8k/" # load data gbm <- load_cellranger_matrix(my_dir) Searching for genomes in: ~/muse/monocle/pbmc8k//outs/filtered_gene_bc_matrices Using GRCh38 in folder: ~/muse/monocle/pbmc8k//outs/filtered_gene_bc_matrices/GRCh38 Loaded matrix information Loaded gene information Loaded barcode information Could not find summary csv: ~/muse/monocle/pbmc8k//outs/metrics_summary.csv. This file is only necessary if you are performing depth-normalization (calling the equalize_gbms function) in R. If this pipestance was produced by `cellranger aggr` with the default parameters, depth-normalization in R (via equalize_gbms) is not necessary. # what class is the object? class(gbm) [1] "GeneBCMatrix" attr(,"package") [1] "cellrangerRkit"
Conveniently, we can use the exprs(), pData(), and fData() functions to retrieve the expression, experimental phenotypes (phenoData), and feature information, respectively, from the gbm object. These functions are from the Bioconductor ExpressionSet class, which is handy because the CellDataSet class used in Monocle is also derived from this class.
# check out the expression data using exprs() # 33,694 genes across 8,381 single cells dim(exprs(gbm)) [1] 33694 8381 # nothing exciting going on in the first five genes in the first five cells exprs(gbm)[1:5, 1:5] 5 x 5 sparse Matrix of class "dgTMatrix" AAACCTGAGCATCATC-1 AAACCTGAGCTAACTC-1 AAACCTGAGCTAGTGG-1 AAACCTGCACATTAGC-1 AAACCTGCACTGTTAG-1 ENSG00000243485 . . . . . ENSG00000237613 . . . . . ENSG00000186092 . . . . . ENSG00000238009 . . . . . ENSG00000239945 . . . . . # check out the phenotypic data dim(pData(gbm)) [1] 8381 1 head(pData(gbm)) barcode AAACCTGAGCATCATC-1 AAACCTGAGCATCATC-1 AAACCTGAGCTAACTC-1 AAACCTGAGCTAACTC-1 AAACCTGAGCTAGTGG-1 AAACCTGAGCTAGTGG-1 AAACCTGCACATTAGC-1 AAACCTGCACATTAGC-1 AAACCTGCACTGTTAG-1 AAACCTGCACTGTTAG-1 AAACCTGCATAGTAAG-1 AAACCTGCATAGTAAG-1 # check out the feature information dim(fData(gbm)) [1] 33694 2 # this table will be useful for matching gene IDs to symbols head(fData(gbm)) id symbol ENSG00000243485 ENSG00000243485 RP11-34P13.3 ENSG00000237613 ENSG00000237613 FAM138A ENSG00000186092 ENSG00000186092 OR4F5 ENSG00000238009 ENSG00000238009 RP11-34P13.7 ENSG00000239945 ENSG00000239945 RP11-34P13.8 ENSG00000239906 ENSG00000239906 RP11-34P13.14
The main object class in Monocle is the CellDataSet object; to get started we need to create a CellDataSet object with the newCellDataSet() function. The CellDataSet object was derived from the ExpressionSet class, so it's easy to create, since the gbm object was also derived from the same class. However, Monocle expects that the gene symbol column in the feature data is called gene_short_name or else you will get a warning.
# warning! my_cds <- newCellDataSet(exprs(gbm), phenoData = new("AnnotatedDataFrame", data = pData(gbm)), featureData = new("AnnotatedDataFrame", data = fData(gbm)), lowerDetectionLimit = 0.5, expressionFamily = negbinomial.size()) Warning message: In newCellDataSet(exprs(gbm), phenoData = new("AnnotatedDataFrame", : None of your featureData columns are named 'gene_short_name', some functions will not be able to take this function as input as a result # rename gene symbol column my_feat <- fData(gbm) names(my_feat) <- c('id', 'gene_short_name') # no warning my_cds <- newCellDataSet(exprs(gbm), phenoData = new("AnnotatedDataFrame", data = pData(gbm)), featureData = new("AnnotatedDataFrame", data = my_feat), lowerDetectionLimit = 0.5, expressionFamily = negbinomial.size()) my_cds CellDataSet (storageMode: environment) assayData: 33694 features, 8381 samples element names: exprs protocolData: none phenoData sampleNames: AAACCTGAGCATCATC-1 AAACCTGAGCTAACTC-1 ... TTTGTCATCTGCTTGC-1 (8381 total) varLabels: barcode Size_Factor varMetadata: labelDescription featureData featureNames: ENSG00000243485 ENSG00000237613 ... ENSG00000268674 (33694 total) fvarLabels: id gene_short_name fvarMetadata: labelDescription experimentData: use 'experimentData(object)' Annotation: # see ?CellDataSet for more information slotNames(my_cds) [1] "reducedDimS" "reducedDimW" "reducedDimA" "reducedDimK" [5] "minSpanningTree" "cellPairwiseDistances" "expressionFamily" "lowerDetectionLimit" [9] "dispFitInfo" "dim_reduce_type" "auxOrderingData" "auxClusteringData" [13] "experimentData" "assayData" "phenoData" "featureData" [17] "annotation" "protocolData" ".__classVersion__"
It is important that the appropriate expression family is used when creating a new CellDataSet. We have UMI data (that is quite large), so we'll use negbinomial.size(). If your data contains relative counts (e.g. FPKM or TPM values), you can use relative2abs() to convert these measurements into absolute counts.
Next, we'll perform some normalisation and variance estimation steps, which will be used in the differential expression analyses later on.
my_cds <- estimateSizeFactors(my_cds) my_cds <- estimateDispersions(my_cds) Removing 146 outliers Warning message: Deprecated, use tibble::rownames_to_column() instead.
Now let's explore the data! First we will need to run the detectGenes() function, which tallies the number of cells expressing a gene and the number of genes expressed among all cells. This will be clearer when you see the tables.
my_cds <- detectGenes(my_cds, min_expr = 0.1) head(fData(my_cds)) id gene_short_name num_cells_expressed ENSG00000243485 ENSG00000243485 RP11-34P13.3 0 ENSG00000237613 ENSG00000237613 FAM138A 0 ENSG00000186092 ENSG00000186092 OR4F5 0 ENSG00000238009 ENSG00000238009 RP11-34P13.7 24 ENSG00000239945 ENSG00000239945 RP11-34P13.8 1 ENSG00000239906 ENSG00000239906 RP11-34P13.14 0 summary(fData(my_cds)$num_cells_expressed) Min. 1st Qu. Median Mean 3rd Qu. Max. 0.0 0.0 5.0 349.9 234.0 8381.0
The num_cells_expressed column is a tally of the number of cells expressing a particular gene (a gene is "expressed" if there is at least one count since we set min_expr = 0.1). For example if we tally ENSG00000239945 across all cells, we should get 1, as indicated in the table above.
sum((exprs(my_cds['ENSG00000239945',]))) [1] 1 sum((exprs(my_cds['ENSG00000238009',]))) [1] 24
The number of genes expressed (num_genes_expressed) per cell is stored in phenoData. Note that if a gene has 10 UMIs or 1 UMI, it is still tallied as 1.
head(pData(my_cds)) barcode Size_Factor num_genes_expressed AAACCTGAGCATCATC-1 AAACCTGAGCATCATC-1 0.5672842 871 AAACCTGAGCTAACTC-1 AAACCTGAGCTAACTC-1 0.4014116 806 AAACCTGAGCTAGTGG-1 AAACCTGAGCTAGTGG-1 1.0710628 1316 AAACCTGCACATTAGC-1 AAACCTGCACATTAGC-1 0.6606467 898 AAACCTGCACTGTTAG-1 AAACCTGCACTGTTAG-1 1.1058961 1526 AAACCTGCATAGTAAG-1 AAACCTGCATAGTAAG-1 1.0521060 1495 # sanity check for AAACCTGAGCATCATC-1 sum((exprs(my_cds)[,"AAACCTGAGCATCATC-1"])>0) [1] 871 summary(pData(my_cds)$num_genes_expressed) Min. 1st Qu. Median Mean 3rd Qu. Max. 502 1111 1297 1407 1548 4800 # standardise to Z-distribution x <- pData(my_cds)$num_genes_expressed x_1 <- (x - mean(x)) / sd(x) summary(x_1) Min. 1st Qu. Median Mean 3rd Qu. Max. -1.8839 -0.6155 -0.2282 0.0000 0.2946 7.0675 library(ggplot2) # I like the default theme of cowplot library(cowplot) df <- data.frame(x = x_1) ggplot(df, aes(x)) + geom_histogram(bins = 50) + geom_vline(xintercept = c(-2, 2), linetype = "dotted", color = 'red')
Only a few cells have an abnormal number of expressed genes, such as the ones that are 5 standard deviations away from the mean.
We can also add our own metadata, such as the UMI count, to the phenoData.
# add a UMI column into phenoData # use Matrix because data is in the sparse format pData(my_cds)$UMI <- Matrix::colSums(exprs(my_cds)) head(pData(my_cds)) barcode Size_Factor num_genes_expressed UMI AAACCTGAGCATCATC-1 AAACCTGAGCATCATC-1 0.5672842 871 2394 AAACCTGAGCTAACTC-1 AAACCTGAGCTAACTC-1 0.4014116 806 1694 AAACCTGAGCTAGTGG-1 AAACCTGAGCTAGTGG-1 1.0710628 1316 4520 AAACCTGCACATTAGC-1 AAACCTGCACATTAGC-1 0.6606467 898 2788 AAACCTGCACTGTTAG-1 AAACCTGCACTGTTAG-1 1.1058961 1526 4667 AAACCTGCATAGTAAG-1 AAACCTGCATAGTAAG-1 1.0521060 1495 4440 ggplot(pData(my_cds), aes(num_genes_expressed, UMI)) + geom_point()
We could remove the cells with much higher gene (and UMI) counts as they might be doublets but I'll include them in this analysis.
Clustering cells without marker genes
Monocle provides several alternative approaches for clustering cells and the recommended approach is to use marker genes. However, since I don't know of any specific markers for this dataset I will use the unsupervised approach. The first step is determining a subset of genes to use for clustering; this is because not all genes are informative, such as those that are lowly expressed. The approach is to select gene based on their average expression and variability across cells. The dispersionTable() function calculates the mean and dispersion values.
disp_table <- dispersionTable(my_cds) head(disp_table) gene_id mean_expression dispersion_fit dispersion_empirical 1 ENSG00000238009 0.0026983425 52.863362 0.000000000 2 ENSG00000279457 0.1548382003 1.068659 0.356959667 3 ENSG00000228463 0.0543320140 2.767983 0.005359107 4 ENSG00000237094 0.0022958750 62.104025 30.358663597 5 ENSG00000230021 0.0002830376 502.693229 0.000000000 6 ENSG00000237491 0.0245526857 5.943231 0.000000000
We will select genes, which have a mean expression >= 0.1, to use in the clustering step. The setOrderingFilter() function allows us to indicate which genes we want to use for clustering. The plot_ordering_genes() function plots mean expression against the empirical dispersion and highlights the set of genes (as black dots) that will be used for clustering.
table(disp_table$mean_expression>=0.1) FALSE TRUE 15422 4012 unsup_clustering_genes <- subset(disp_table, mean_expression >= 0.1) my_cds <- setOrderingFilter(my_cds, unsup_clustering_genes$gene_id) plot_ordering_genes(my_cds) Warning messages: 1: Transformation introduced infinite values in continuous y-axis 2: Transformation introduced infinite values in continuous y-axis
It will take a lot of computational time to cluster 8,381 cells even when we have subsetted the total number of genes down to 4,012. The usual approach is to perform dimension reduction (typically Principal Component Analysis [PCA]) and use the principal components that explain the majority of the variance in the data. The plot_pc_variance_explained() function plots the percentage of variance explained by each component based on a PCA performed on the normalised expression data. From the plot we can decide the number of components to use in our downstream analyses.
plot_pc_variance_explained(my_cds, return_all = FALSE) Warning in (function (A, nv = 5, nu = nv, maxit = 100, work = nv + 7, reorth = TRUE, : did not converge--results might be invlaid!; try increasing maxit or fastpath=FALSE
The goal is to identify the elbow in this plot, which can be a bit subjective.
The warning message is from the irlba() function, which was used to perform the PCA. The "results might be invlaid [sic]" warning is a bit worrying though, so I'm not sure whether or not the plot is valid. I'll perform dimension reduction using five and ten principal components (by setting the num_dim parameter, which is used as input to the prcomp_irlba() function) and compare the clustering results.
The clusterCells() function performs the clustering on a CellDataSet; however, you need to input the number of requested clusters (I arbitrarily chose 15). The results of the clustering are stored in the phenoData table.
# if num_dim is not set, the default is set to 50 # I'll use 5 here based on plot_pc_variance_explained my_cds <- reduceDimension(my_cds, max_components = 2, num_dim = 5, reduction_method = 'tSNE', verbose = TRUE) Remove noise by PCA ... Reduce dimension by tSNE ... # perform unsupervised clustering requesting 15-1 clusters my_cds <- clusterCells(my_cds, num_clusters = 15) Distance cutoff calculated to 4.797959 Warning messages: 1: In if (method == "DDRTree") { : the condition has length > 1 and only the first element will be used 2: In if (method == "densityPeak") { : the condition has length > 1 and only the first element will be used # cluster information is in the Cluster column head(pData(my_cds)) barcode Size_Factor num_genes_expressed Cluster peaks halo delta rho AAACCTGAGCATCATC-1 AAACCTGAGCATCATC-1 0.5672842 871 4 FALSE FALSE 0.44962230 145.2827 AAACCTGAGCTAACTC-1 AAACCTGAGCTAACTC-1 0.4014116 806 12 FALSE TRUE 0.33860313 131.3374 AAACCTGAGCTAGTGG-1 AAACCTGAGCTAGTGG-1 1.0710628 1316 5 FALSE TRUE 0.87331834 128.1797 AAACCTGCACATTAGC-1 AAACCTGCACATTAGC-1 0.6606467 898 9 FALSE TRUE 0.36330440 123.4466 AAACCTGCACTGTTAG-1 AAACCTGCACTGTTAG-1 1.1058961 1526 11 FALSE TRUE 0.15344369 104.5582 AAACCTGCATAGTAAG-1 AAACCTGCATAGTAAG-1 1.0521060 1495 11 FALSE TRUE 0.07492976 119.3500 # store cluster info for comparing later my_cluster_dim_5 <- pData(my_cds)$Cluster plot_cell_clusters(my_cds)
Now with num_dim = 10.
my_cds <- reduceDimension(my_cds, max_components = 2, num_dim = 10, reduction_method = 'tSNE', verbose = TRUE) Remove noise by PCA ... Reduce dimension by tSNE ... my_cds <- clusterCells(my_cds, num_clusters = 15) Distance cutoff calculated to 4.29865 Warning messages: 1: In if (method == "DDRTree") { : the condition has length > 1 and only the first element will be used 2: In if (method == "densityPeak") { : the condition has length > 1 and only the first element will be used my_cluster_dim_10 <- pData(my_cds)$Cluster plot_cell_clusters(my_cds)
We'll use the Adjusted Rand Index (ARI) to compare the clustering results. The adjustedRand() function in the clues package calculates the Rand Index (RI) and different variations of the ARI (HA, MA, and FM).
library(clues) adjustedRand(as.numeric(my_cluster_dim_5), as.numeric(my_cluster_dim_10)) Rand HA MA FM Jaccard 0.9047255 0.3825039 0.3833038 0.4349620 0.2774384
I wrote about why there is a difference between the RI and ARI. The ARI (HA) suggests that the clustering of cells is quite different when including different numbers of principal components. I also compared clustering results with num_dim = 4 and num_dim = 5 and while the ARI is higher than the comparison between num_dim = 5 and num_dim = 10, it still implies that the cell clusters are quite different.
# include 4 dimensions my_cds <- reduceDimension(my_cds, max_components = 2, num_dim = 4, reduction_method = 'tSNE', verbose = TRUE) my_cds <- clusterCells(my_cds, num_clusters = 15) my_cluster_dim_4 <- pData(my_cds)$Cluster # include 10 dimensions my_cds <- reduceDimension(my_cds, max_components = 2, num_dim = 5, reduction_method = 'tSNE', verbose = TRUE) my_cds <- clusterCells(my_cds, num_clusters = 15) my_cluster_dim_10 <- pData(my_cds)$Cluster # compare clustering results using 4 and 5 principal components clues::adjustedRand(as.numeric(my_cluster_dim_4), as.numeric(my_cluster_dim_5)) Rand HA MA FM Jaccard 0.9244489 0.5236041 0.5242011 0.5650728 0.3936819
For the rest of the analysis, I will be using num_dim = 10 and num_clusters = 15.
# include 10 dimensions my_cds <- reduceDimension(my_cds, max_components = 2, num_dim = 10, reduction_method = 'tSNE', verbose = TRUE) my_cds <- clusterCells(my_cds, num_clusters = 15)
Differential expression
One of the main interests of single cell transcriptomics is the identification of novel markers for cell types. We can use differentialGeneTest(), which is quite a flexible function, to find marker genes by assessing each gene's expression level across the cells. The model formulae used in the function can use any column in the phenoData table, including columns that we manually add. Here's the description of how it works from the Monocle tutorial page:
Monocle's differential expression analysis works essentially by fitting two models to the expression values for each gene, working through each gene independently. The first of the two models is called the full model. This model is essentially a way of predicting the expression value of each gene in a given cell knowing only whether that cell is a fibroblast or a myoblast. The second model, called the reduced model, does the same thing, but it doesn't know the CellType for each cell. It has to come up with a reasonable prediction of the expression value for the gene that will be used for all the cells. Because the full model has more information about each cell, it will do a better job of predicting the expression of the gene in each cell. The question Monocle must answer for each gene is how much better the full model's prediction is than the reduced model's. The greater the improvement that comes from knowing the CellType of each cell, the more significant the differential expression result.
I'll create a vector that indicates whether a single was clustered in "Cluster 1" or not to identify genes differentially expressed in cluster 1 versus the other clusters.
# create vector of no's my_vector <- rep('no', nrow(pData(my_cds))) # change status to yes if the cell was in cluster 1 my_vector[pData(my_cds)$Cluster == 1] <- rep('yes', sum(pData(my_cds)$Cluster == 1)) # add vector to phenoData pData(my_cds)$test <- my_vector head(pData(my_cds)) barcode Size_Factor num_genes_expressed Cluster peaks halo delta rho test my_colour AAACCTGAGCATCATC-1 AAACCTGAGCATCATC-1 0.5672842 871 7 FALSE FALSE 0.1914281 174.08627 no FALSE AAACCTGAGCTAACTC-1 AAACCTGAGCTAACTC-1 0.4014116 806 6 FALSE TRUE 0.1170417 142.70796 no TRUE AAACCTGAGCTAGTGG-1 AAACCTGAGCTAGTGG-1 1.0710628 1316 3 FALSE FALSE 0.9993782 149.51023 no FALSE AAACCTGCACATTAGC-1 AAACCTGCACATTAGC-1 0.6606467 898 12 FALSE TRUE 0.3166583 153.24806 no FALSE AAACCTGCACTGTTAG-1 AAACCTGCACTGTTAG-1 1.1058961 1526 1 FALSE TRUE 0.1818162 91.74234 yes TRUE AAACCTGCATAGTAAG-1 AAACCTGCATAGTAAG-1 1.0521060 1495 4 FALSE TRUE 0.7871118 117.33641 no TRUE
I'll perform the differential expression analysis on the subset of genes where the mean expression (across cells) was >= 0.1 (see section "Clustering cells without marker genes").
length(unsup_clustering_genes$gene_id) [1] 4012 de_cluster_one <- differentialGeneTest(my_cds[unsup_clustering_genes$gene_id,], fullModelFormulaStr = '~test', cores = 8) # not sure why we have results only for 479 genes out of 4,012 dim(de_cluster_one) [1] 479 8 # order by q-value library(dplyr) de_cluster_one %>% arrange(qval) %>% head() status family pval qval id gene_short_name num_cells_expressed use_for_ordering 1 OK negbinomial.size 0.000000e+00 0.000000e+00 ENSG00000163131 CTSS 5547 TRUE 2 OK negbinomial.size 0.000000e+00 0.000000e+00 ENSG00000196126 HLA-DRB1 5303 TRUE 3 OK negbinomial.size 0.000000e+00 0.000000e+00 ENSG00000203747 FCGR3A 951 TRUE 4 OK negbinomial.size 0.000000e+00 0.000000e+00 ENSG00000204287 HLA-DRA 6195 TRUE 5 OK negbinomial.size 0.000000e+00 0.000000e+00 ENSG00000227507 LTB 6720 TRUE 6 OK negbinomial.size 3.219336e-244 2.483842e-242 ENSG00000182287 AP1S2 3697 TRUE
Let's check out CTSS (ENSG00000163131), which is the most statistically significant gene, using plot_genes_fitter().
plot_genes_jitter(my_cds['ENSG00000163131',], grouping = "Cluster")
CTSS is specific for cluster 1 but also for clusters 4 and 6. We count data hence the expression is on a discrete scale.
We can use plot_cell_clusters() to highlight clusters 1, 4, and 6.
# add another column to the phenotypic data # where TRUE means cluster membership in clusters 1, 4, or 6 pData(my_cds)$my_colour <- pData(my_cds)$Cluster == 1 | pData(my_cds)$Cluster == 4 | pData(my_cds)$Cluster == 6 plot_cell_clusters(my_cds, color_by = 'my_colour')
CTSS is overexpressed in clusters 1, 4, and 6, which are highlighted in turquoise.
Constructing Single Cell Trajectories
In Monocle, a single cell trajectory is the inferred developmental timeline of single cells. From the tutorial:
Monocle uses an algorithm to learn the sequence of gene expression changes each cell must go through as part of a dynamic biological process. Once it has learned the overall "trajectory" of gene expression changes, Monocle can place each cell at its proper position in the trajectory.
The "unit" used for the trajectory is pseudotime; a cell at the beginning of the trajectory, i.e. starting state, will have a lower pseudotime than cells at the end of the trajectory, i.e. end state. Specifically, it is:
the distance between a cell and the start of the trajectory, measured along the shortest path. The trajectory's total length is defined in terms of the total amount of transcriptional change that a cell undergoes as it moves from the starting state to the end state.
The trajectory analysis consists of three stages:
- Choose genes that define progress
- Reduce the dimensionality of the data
- Order cells in pseudotime
For the trajectory analysis in this post, I will use only a subset of all cells (cluster 1, 4, and 6) and genes that are expressed in at least 10 cells.
expressed_genes <- row.names(subset(fData(my_cds), num_cells_expressed >= 10)) # my_colour is TRUE if a cell belongs to either cluster 1, 4, or 6 my_cds_subset <- my_cds[expressed_genes, pData(my_cds)$my_colour] # 15,446 genes and 2,060 cells my_cds_subset CellDataSet (storageMode: environment) assayData: 15446 features, 2060 samples element names: exprs protocolData: none phenoData sampleNames: AAACCTGAGCTAACTC-1 AAACCTGCACTGTTAG-1 ... TTTGTCATCGTCTGAA-1 (2060 total) varLabels: barcode Size_Factor ... my_colour (10 total) varMetadata: labelDescription featureData featureNames: ENSG00000238009 ENSG00000279457 ... ENSG00000271254 (15446 total) fvarLabels: id gene_short_name num_cells_expressed use_for_ordering fvarMetadata: labelDescription experimentData: use 'experimentData(object)' Annotation:
We'll use the recommended approach of ordering based on genes that differ between clusters. First, we'll perform another subset of genes, keeping only genes expressed in greater than 5% of cells.
my_cds_subset <- detectGenes(my_cds_subset, min_expr = 0.1) fData(my_cds_subset)$use_for_ordering <- fData(my_cds_subset)$num_cells_expressed > 0.05 * ncol(my_cds_subset) # how many genes are used? table(fData(my_cds_subset)$use_for_ordering) FALSE TRUE 9141 6305 plot_pc_variance_explained(my_cds_subset, return_all = FALSE) Warning in (function (A, nv = 5, nu = nv, maxit = 100, work = nv + 7, reorth = TRUE, : did not converge--results might be invlaid!; try increasing maxit or fastpath=FALSE
There's the same warning message as before when running plot_pc_variance_explained(), so I'll just use num_dim = 10 and perform clustering as before (see section "Clustering cells without marker genes") but this time without specifying the number of clusters; we will use thresholds on the cell's local density (rho) and nearest distance (delta) to determine the number of clusters.
my_cds_subset <- reduceDimension(my_cds_subset, max_components = 2, norm_method = 'log', num_dim = 10, reduction_method = 'tSNE', verbose = TRUE) my_cds_subset <- clusterCells(my_cds_subset, verbose = FALSE) Distance cutoff calculated to 3.551792 Warning messages: 1: In if (method == "DDRTree") { : the condition has length > 1 and only the first element will be used 2: In if (method == "densityPeak") { : the condition has length > 1 and only the first element will be used plot_rho_delta(my_cds_subset, rho_threshold = 2, delta_threshold = 10)
Scatter plot of rho versus delta.
We'll use rho = 2 and delta = 10 to cluster the cells again.
my_cds_subset <- clusterCells(my_cds_subset, rho_threshold = 2, delta_threshold = 10, skip_rho_sigma = T, verbose = FALSE) Warning messages: 1: In if (method == "DDRTree") { : the condition has length > 1 and only the first element will be used 2: In if (method == "densityPeak") { : the condition has length > 1 and only the first element will be used table(pData(my_cds_subset)$Cluster) 1 2 3 4 229 405 175 1251 plot_cell_clusters(my_cds_subset)
Now we'll perform the differential gene expression analysis as before but across all cell clusters.
clustering_DEG_genes <- differentialGeneTest(my_cds_subset, fullModelFormulaStr = '~Cluster', cores = 8) Warning messages: 1: In inherits(x, "ordered") : closing unused connection 10 (<-localhost:11278) 2: In inherits(x, "ordered") : closing unused connection 9 (<-localhost:11278) 3: In inherits(x, "ordered") : closing unused connection 8 (<-localhost:11278) 4: In inherits(x, "ordered") : closing unused connection 7 (<-localhost:11278) 5: In inherits(x, "ordered") : closing unused connection 6 (<-localhost:11278) 6: In inherits(x, "ordered") : closing unused connection 5 (<-localhost:11278) 7: In inherits(x, "ordered") : closing unused connection 4 (<-localhost:11278) 8: In inherits(x, "ordered") : closing unused connection 3 (<-localhost:11278) dim(clustering_DEG_genes) [1] 15446 8 clustering_DEG_genes %>% arrange(qval) %>% head() status family pval qval id gene_short_name num_cells_expressed use_for_ordering 1 OK negbinomial.size 0.000000e+00 0.000000e+00 ENSG00000143546 S100A8 2034 TRUE 2 OK negbinomial.size 0.000000e+00 0.000000e+00 ENSG00000163220 S100A9 2049 TRUE 3 OK negbinomial.size 0.000000e+00 0.000000e+00 ENSG00000163221 S100A12 1740 TRUE 4 OK negbinomial.size 0.000000e+00 0.000000e+00 ENSG00000203747 FCGR3A 358 TRUE 5 OK negbinomial.size 1.106361e-319 3.417771e-316 ENSG00000231389 HLA-DPA1 1671 TRUE 6 OK negbinomial.size 2.237553e-312 5.760207e-309 ENSG00000090382 LYZ 2057 TRUE
We'll use the top 1,000 most significantly differentially expressed genes as the set of ordering genes and perform the dimension reduction and the trajectory analysis (using the orderCells() function).
my_ordering_genes <- row.names(clustering_DEG_genes)[order(clustering_DEG_genes$qval)][1:1000] my_cds_subset <- setOrderingFilter(my_cds_subset, ordering_genes = my_ordering_genes) my_cds_subset <- reduceDimension(my_cds_subset, method = 'DDRTree') # the warnings were for use of deprecated code my_cds_subset <- orderCells(my_cds_subset) There were 50 or more warnings (use warnings() to see the first 50) plot_cell_trajectory(my_cds_subset, color_by = "Cluster")
Finding Genes that Change as a Function of Pseudotime
Once we have a trajectory, we can use differentialGeneTest() to find genes that have an expression pattern that varies according to pseudotime.
# pseudotime is now a column in the phenotypic data as well as the cell state head(pData(my_cds_subset)) barcode Size_Factor num_genes_expressed Cluster peaks halo delta rho test my_colour AAACCTGAGCTAACTC-1 AAACCTGAGCTAACTC-1 0.4014116 806 4 FALSE TRUE 0.5737701 35.64480 no TRUE AAACCTGCACTGTTAG-1 AAACCTGCACTGTTAG-1 1.1058961 1523 2 FALSE TRUE 0.7596041 38.15302 yes TRUE AAACCTGCATAGTAAG-1 AAACCTGCATAGTAAG-1 1.0521060 1495 4 FALSE TRUE 0.5180008 33.71503 no TRUE AAACCTGGTAGAAGGA-1 AAACCTGGTAGAAGGA-1 1.3016257 1628 2 FALSE TRUE 0.4064823 34.54570 yes TRUE AAACCTGGTTTGCATG-1 AAACCTGGTTTGCATG-1 1.5070707 1899 2 FALSE TRUE 0.3787114 35.82955 yes TRUE AAACCTGTCGTGGACC-1 AAACCTGTCGTGGACC-1 1.0878871 1321 3 FALSE FALSE 0.6943011 37.86991 yes TRUE Pseudotime State AAACCTGAGCTAACTC-1 17.595545 5 AAACCTGCACTGTTAG-1 6.930170 1 AAACCTGCATAGTAAG-1 15.136582 5 AAACCTGGTAGAAGGA-1 4.916377 1 AAACCTGGTTTGCATG-1 5.139131 1 AAACCTGTCGTGGACC-1 12.394433 4 my_pseudotime_de <- differentialGeneTest(my_cds_subset, fullModelFormulaStr = "~sm.ns(Pseudotime)", cores = 8) my_pseudotime_de %>% arrange(qval) %>% head() status family pval qval id gene_short_name num_cells_expressed use_for_ordering 1 OK negbinomial.size 0 0 ENSG00000143546 S100A8 2034 TRUE 2 OK negbinomial.size 0 0 ENSG00000163220 S100A9 2049 TRUE 3 OK negbinomial.size 0 0 ENSG00000163221 S100A12 1740 TRUE 4 OK negbinomial.size 0 0 ENSG00000203747 FCGR3A 358 TRUE 5 OK negbinomial.size 0 0 ENSG00000231389 HLA-DPA1 1671 TRUE 6 OK negbinomial.size 0 0 ENSG00000257764 RP11-1143G9.4 1896 TRUE # save the top 6 genes my_pseudotime_de %>% arrange(qval) %>% head() %>% select(id) -> my_pseudotime_gene my_pseudotime_gene <- my_pseudotime_gene$id plot_genes_in_pseudotime(my_cds_subset[my_pseudotime_gene,])
Clustering Genes by Pseudotemporal Expression Pattern
Monocle provides functionality for clustering genes according to their pseudotime value. The clustering analysis and plotting are done in one step using the plot_pseudotime_heatmap() function; to save the clustering results, use return_heatmap = TRUE.
# cluster the top 50 genes that vary as a function of pseudotime my_pseudotime_de %>% arrange(qval) %>% head(50) %>% select(id) -> gene_to_cluster gene_to_cluster <- gene_to_cluster$id my_pseudotime_cluster <- plot_pseudotime_heatmap(my_cds_subset[gene_to_cluster,], num_clusters = 3, cores = 8, show_rownames = TRUE, return_heatmap = TRUE)
The columns of the heatmap are pseudotime values binned into 100 bins. Here's the source code:
newdata <- data.frame(Pseudotime = seq(min(pData(my_cds_subset)$Pseudotime), max(pData(my_cds_subset)$Pseudotime), length.out = 100))
Since we saved the results in my_pseudotime_cluster, we can extract the genes for each cluster.
# hierarchical clustering was used to cluster the genes # we can cut the dendrogram to form the same 3 clusters as plot_pseudotime_heatmap my_cluster <- cutree(my_pseudotime_cluster$tree_row, 3) my_cluster ENSG00000143546 ENSG00000163220 ENSG00000163221 ENSG00000203747 ENSG00000231389 ENSG00000257764 ENSG00000163737 ENSG00000155366 1 1 1 2 3 1 2 2 ENSG00000163736 ENSG00000090382 ENSG00000038427 ENSG00000223865 ENSG00000019582 ENSG00000169429 ENSG00000196735 ENSG00000142089 2 1 1 3 3 1 3 2 ENSG00000187109 ENSG00000185201 ENSG00000168497 ENSG00000166927 ENSG00000129757 ENSG00000188290 ENSG00000197353 ENSG00000163563 2 2 2 2 2 2 2 1 ENSG00000123416 ENSG00000127920 ENSG00000180573 ENSG00000214548 ENSG00000101162 ENSG00000196126 ENSG00000170345 ENSG00000103187 2 2 2 2 2 3 1 2 ENSG00000204525 ENSG00000130066 ENSG00000130522 ENSG00000158517 ENSG00000125898 ENSG00000108798 ENSG00000065978 ENSG00000173372 3 2 1 1 2 2 2 2 ENSG00000170873 ENSG00000170458 ENSG00000212907 ENSG00000234745 ENSG00000204287 ENSG00000171611 ENSG00000132965 ENSG00000162444 2 1 1 3 3 2 1 1 ENSG00000105372 ENSG00000148737 2 2 # genes in cluster 1 my_pseudotime_de[names(my_cluster[my_cluster == 1]),"gene_short_name"] [1] "S100A8" "S100A9" "S100A12" "RP11-1143G9.4" "LYZ" "VCAN" "CXCL8" [8] "MNDA" "FOS" "JUND" "NCF1" "CD14" "MT-ND4L" "ALOX5AP" [15] "RBP7" # genes in cluster 2 my_pseudotime_de[names(my_cluster[my_cluster == 2]),"gene_short_name"] [1] "FCGR3A" "PF4" "RHOC" "PPBP" "IFITM3" "NAP1L1" "IFITM2" "SDPR" "MS4A7" "CDKN1C" [11] "HES4" "LYPD2" "TUBA1B" "GNG11" "HIST1H2AC" "MEG3" "TUBB1" "COTL1" "SAT1" "FAM110A" [21] "ABI3" "YBX1" "C1QA" "MTSS1" "PTCRA" "RPS19" "TCF7L2" # genes in cluster 3 my_pseudotime_de[names(my_cluster[my_cluster == 3]),"gene_short_name"] [1] "HLA-DPA1" "HLA-DPB1" "CD74" "HLA-DQA1" "HLA-DRB1" "HLA-C" "HLA-B" "HLA-DRA"
Analyzing Branches in Single-Cell Trajectories
In our trajectory, we have two branches, which represents cells that have alternative gene expression patterns. These represent cells that have supposedly gone through different developmental paths. Monocle provides functions that allows you to identify the genes that differ at a particular branch point. Here is the trajectory again.
plot_cell_trajectory(my_cds_subset, color_by = "Cluster")
The BEAM() function takes a CellDataSet that has been ordered with orderCells() and a branch point in the trajectory. A table of genes is returned with significance values that indicate whether genes have expression patterns that are branch dependent.
# warnings not shown BEAM_res <- BEAM(my_cds_subset, branch_point = 1, cores = 8) BEAM_res <- BEAM_res[order(BEAM_res$qval),] BEAM_res <- BEAM_res[,c("gene_short_name", "pval", "qval")] # check out the results head(BEAM_res) gene_short_name pval qval ENSG00000143546 S100A8 5.715373e-202 8.388453e-198 ENSG00000163220 S100A9 2.170305e-139 1.592678e-135 ENSG00000163221 S100A12 3.222573e-113 1.576590e-109 ENSG00000223865 HLA-DPB1 4.574242e-93 1.678404e-89 ENSG00000196735 HLA-DQA1 6.264157e-78 1.838781e-74 ENSG00000196126 HLA-DRB1 2.437269e-77 5.961965e-74 table(BEAM_res$qval < 1e-4) FALSE TRUE 15230 216 my_branched_heatmap <- plot_genes_branched_heatmap(my_cds_subset[row.names(subset(BEAM_res, qval < 1e-4)),], branch_point = 1, num_clusters = 4, cores = 8, use_gene_short_name = TRUE, show_rownames = TRUE, return_heatmap = TRUE)
The heatmap shows how some genes are over-expressed or under-expressed depending on the trajectory path.
We can return genes that belong to specific clusters that were identified by BEAM().
head(my_branched_heatmap$annotation_row) Cluster S100A8 1 S100A9 1 S100A12 1 HLA-DPB1 2 HLA-DQA1 2 HLA-DRB1 2 dim(my_branched_heatmap$annotation_row) [1] 216 1 table(my_branched_heatmap$annotation_row$Cluster) 1 2 3 4 67 32 87 30 my_row <- my_branched_heatmap$annotation_row my_row <- data.frame(cluster = my_row$Cluster, gene = row.names(my_row), stringsAsFactors = FALSE) head(my_row[my_row$cluster == 3,'gene']) [1] "RPL3" "RPS12" "RPL5" "EEF1A1" "RPSA" "RPS6" my_gene <- row.names(subset(fData(my_cds_subset), gene_short_name %in% head(my_row[my_row$cluster == 3,'gene']))) # plot genes that are expressed in a branch dependent manner plot_genes_branched_pseudotime(my_cds_subset[my_gene,], branch_point = 1, ncol = 1)
The trend lines show the expression pattern of genes along the paths formed by branch point 1.
This work is licensed under a Creative Commons
Attribution 4.0 International License.
Thanks for your article,I have a question about how to confirm the number of cluster and dim, it is so subjective.Thank you
Thank you so much Mr Tang! I have learned a lot from your blog to use monocle, seurat, intersect and so on. Your blog is soooo helpful ! Thanks again!
Mr. Tang, Thank you very much!
I am trying to follow your blog but I noticed little bit problem.
Could you suggest me in this regard ?
—– plot_pc_variance_explained(my_cds, return_all = F) # norm_method=’log’
Problem is the above command is not executing .
> plot_pc_variance_explained(my_cds, return_all = F)
Error in (function (A, nv = 5, nu = nv, maxit = 100, work = nv + 7, reorth = TRUE, :
BLAS/LAPACK routine ‘DLASCL’ gave error code -4
I don’t know why.??
the package of irlba or sth. related not installed complitely
for irlba, a matrix computing package, you need to install the lastest github version ,then monocle can use it
Hi Dave,
Do you know I would go about doing differential expression analysis between two samples and not between clusters?
Cheers
Hi Lucy,
I’ve had success using edgeR (https://bioconductor.org/packages/release/bioc/html/edgeR.html) where I just make each cell from one condition as a “replicate”. Something like this:
g <- factor(rep(c('cond1', 'cond2'), c(number_cells_cond1, number_cells_cond2)))
Cheers,
Dave
Thank you! Will give it a go.
Cheers,
Lucy
Hi, Dave
I really appreciated your awesome tutorial. I have one question about plot_cell_clusters. It seems that the arguments in this function such as markers, cell_size have been deprecated. I followed the instruction in Monocle2 website, but I can not make a plot like they did. For example, the authors can map the size of the dots to the log expression value and map the color of the dots according to CellType, in the mean time showing two markers in two separate plots. My R version is 3.8 and my monocle version is 2.8.0. In your tutorial, you only add the CellDataSet object into the function without any other arguments. I am not sure whether it is because you have the same problem or not.
Thanks, hope to hear from you soon.
Jiarui
Great post!
What about if u have cases and controls and u want to follow this protocol ?
Hi Dave,
Thank you for your tutorial on Monocle. It’s very useful.
I don’t quite understand the meaning or use of rho and delta when you do the clustering.
Could you explain it a bit more for me?
Thank you very much.
Hi Benjy,
you can find more information at http://science.sciencemag.org/content/344/6191/1492.full.
Cheers,
Dave
Thank you so much for this tutorial! Very helpful! Above you said “We could remove the cells with much higher gene (and UMI) counts as they might be doublets but I’ll include them in this analysis.”
I’d like to remove cells with very low UMI counts and very high UMI counts, to remove dead cells and doubles. With Seurat, I exclude less than 2000 and above 6000. How do you do this in monocle?
Thanks!
Under Constructing Single Cell Trajectories section:
This line of code does not run:
my_cds_subset <- reduceDimension(my_cds_subset,
max_components = 2,
norm_method = 'log',
num_dim = 10,
reduction_method = 'tSNE',
verbose = TRUE)
if my_cds_subset is generated by: my_cds_subset <- my_cds[expressed_genes, pData(my_cds)$my_colour]; then the error is: Error in `$<-.data.frame`(`*tmp*`, "tsne_1", value = c(-4.94368917177606, : replacement has 8381 rows, data has 2425. I then manually re-level all the factor in pData(my_cds_subset) it still shows this error.
Next I tried to save the cluster 1 and 3 cell barcodes, and re-build the my_cds_subset monocle object from the raw matrix (subsetting with the barcode I saved to generate the exprs matrix and pData). The new object is 15446 features, 2425 samples ; next I rerun the whole monocle pipeline, and when hitting the reduceDimensions() step, the error is now always: Error in .check_tsne_params(nrow(X), dims = dims, perplexity = perplexity, : perplexity is too large for the number of samples
I changed the perplexity to 1, still show the error..
Any suggestion? or Any one has reached beyond the reduceDimension() on my_cds_subset?
Thanks a lot.
I found there is a person answer :perplexity roughly equal to 5-50.
Dear Dave and all,
the code is old.
And `load_cellranger_matrix()`does strange stuff (like adding some subfolder names and looking there). This was quite annoying. So I went to github site of cellrangerRkit, looked into code, and created a hacked version of `load_cellranger_matrix()`:
“`
load_cellranger_matrix <- function(data_dir,
summary_fn=NULL) {
mat_fn <- file.path(data_dir, "matrix.mtx")
gene_fn <- file.path(data_dir, "genes.tsv")
barcode_fn <- file.path(data_dir, "barcodes.tsv")
mat <- readMM(mat_fn)
gene_info <- read.delim(gene_fn, stringsAsFactors=FALSE, sep="\t", header=FALSE)
barcodes <- read.delim(barcode_fn, stringsAsFactors=FALSE, sep="\t", header=FALSE)
if (dim(mat)[1] != dim(gene_info)[1]) {
stop(sprintf("Mismatch dimension between gene file: \n\t %s\n and matrix file: \n\t %s\n",
gene_fn,
mat_fn))
}
if (dim(mat)[2] != dim(barcodes)[1]) {
stop(sprintf("Mismatch dimension between gene file: \n\t %s\n and matrix file: \n\t %s\n",
barcode_fn,
mat_fn))
}
rownames(mat) <- gene_info[, 1]
colnames(mat) <- barcodes[, 1]
gene_symbols <- gene_info
row.names(gene_symbols) <- gene_symbols[, 1]
colnames(gene_symbols) <- c("id", "symbols")
pd <- data.frame(id=barcodes[,1], row.names=barcodes[,1])
colnames(pd) <- c("barcode")
res <- newGeneBCMatrix(mat=mat, fd=gene_symbols, pd=pd)
res@subsampled <- FALSE
if (is.null(summary_fn)) {
warning("No summary file provided. Some functions may be disabled without the metrics in summary csv.\n")
} else {
summary <- read.csv(summary_fn, as.is=TRUE, header=TRUE)
res@summary <- summary
}
res
}
“`
With this function,
“`
gbm <- load_cellranger_matrix(data_dir)
class(gbm)
“`
works now flawlessly!