Cell Ranger is a set of analysis pipelines that process Chromium single cell 3′ RNA-seq data. The pipelines process raw sequencing output, performs read alignment, generate gene-cell matrices, and can perform downstream analyses such as clustering and gene expression analysis. Cell Ranger includes four pipelines:
- cellranger mkfastq
- cellranger count
- cellranger aggr
- cellranger reanalyze
You can download Cell Ranger from their software download page. Conveniently, Cell Ranger is provided as a single self-contained file that bundles all its own software dependencies. You can view the source at their GitHub repository.
This post was started almost a year ago and it’s still missing the cellranger aggr and cellranger reanalyze steps. I’ve decided to post it anyway, since I haven’t had a post for two months.
First, start downloading the FASTQ files (73.61 GB) that we will use later in the post; they are quite large and depending on your Internet speed, may take up to several hours.
wget -c -N http://s3-us-west-2.amazonaws.com/10x.files/samples/cell-exp/2.1.0/pbmc8k/pbmc8k_fastqs.tar
Next, visit the download page to generate your own download link for the Cell Ranger tarball.
wget -O cellranger-2.2.0.tar.gz "http://cf.10xgenomics.com/releases/cell-exp/cellranger-2.2.0.tar.gz?Expires=1533651784&Policy=eyJTdGF0ZW1lbnQiOlt7IlJlc291cmNlIjoiaHR0cDovL2NmLjEweGdlbm9taWNzLmNvbS9yZWxlYXNlcy9jZWxsLWV4cC9jZWxscmFuZ2VyLTIuMi4wLnRhci5neiIsIkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTUzMzY1MTc4NH19fV19&Signature=L5k81QJRjubjoBwUeEcLTDT3uuuixkX4ZEJsNqO2ZLfe~cM2ygqHAJBHQ00xJxAdNPOyY8vu0RBaApGZM1H0-OqdqWg6rK4gdO6EF8lJu~oWv97X7zw6EfhVwISUFyoK22WFYR-MAKZHwxiIn8C3vORnZcxxUuN~sq0guVZ5ievI5iFYrHRhumQQhkQr~2fkbQDp3Bd5wtbyflfwMnE7fpdgy~jTSIbMLbHUICnM3-Gc~flroyEcoyugenrcSHNAcO~F0l~gmOewZnqqbEshxHBU4-faJ1iBixhJBlGHVHagqqpq6xyktVdAEIbX6uM~zawmHgi3axDh4HXffwMNiw__&Key-Pair-Id=APKAI7S6A5RYOXBWRPDA"
Download the hg38 reference tarball.
Download the BCL tarball and sample sheet.
wget http://cf.10xgenomics.com/supp/cell-exp/cellranger-tiny-bcl-1.2.0.tar.gz wget http://cf.10xgenomics.com/supp/cell-exp/cellranger-tiny-bcl-simple-1.2.0.csv
Cell Ranger can perform a test run to verify the installation. We’ll extract the tarball and perform the test run. Follow the official installation instructions for setting up Cell Ranger.
tar -xzf cellranger-2.2.0.tar.gz # test run cellranger testrun --id=testrun \ --localcores=32 \ --localmem=200 # if everything worked you should see the following Pipestance completed successfully! Saving pipestance info to testrun/testrun.mri.tgz
Note that Cell Ranger runs locally and will use 90% of available memory and cores by default. To restrict resource usage, use the command line arguments –localmem and –localcores, as above.
--localcores=NUM Set max cores the pipeline may request at one time. Only applies when --jobmode=local. --localmem=NUM Set max GB the pipeline may request at one time. Only applies when --jobmode=local.
Each core that Cell Ranger uses will spawn 64 user processes, so you may run into problems if your system has a limit on the max user processes. You can use ulimit -u to find out the limit.
ulimit -u 4096
If you have a limit of 4096, set –localcores to 64 or less since 64 * 64 = 4096.
Illumina sequencing instruments generate per-cycle raw base call (BCL) files as primary sequencing output. The cellranger mkfastq pipeline performs the demultiplexing and conversion step for BCL files for each flowcell directory.
This step requires Illumina’s bcl2fastq tool, which needs to be set up prior to running mkfastq. I have compiled v18.104.22.1682 in my bin directory, which has been added to my PATH.
We have already downloaded some sample BCL files in the download section; let’s check it out.
tar -xzf cellranger-tiny-bcl-1.2.0.tar.gz tree --charset=ascii -L 3 cellranger-tiny-bcl-1.2.0 cellranger-tiny-bcl-1.2.0 |-- Config | |-- h35kcbcxy_Oct-20-16\ 15-51-48_Effective.cfg | |-- HiSeqControlSoftware.Options.cfg | |-- RTAStart.bat | `-- Variability_HiSeq_C.bin |-- Data | `-- Intensities | |-- BaseCalls | |-- config.xml | |-- L001 | |-- Offsets | `-- RTAConfiguration.xml |-- InterOp | |-- ControlMetricsOut.bin | |-- CorrectedIntMetricsOut.bin | |-- ErrorMetricsOut.bin | |-- ExtractionMetricsOut.bin | |-- ImageMetricsOut.bin | |-- QMetricsOut.bin | `-- TileMetricsOut.bin |-- RTAComplete.txt |-- RunInfo.xml `-- runParameters.xml 7 directories, 16 files
This is the typical directory structure generated by Illumina systems; the image below is for MiniSeq or NextSeq systems but other systems are quite similar.
Figure from the bcl2fastq2 Conversion Software v2.19 User Guide.
A sample sheet directs the mkfastq step on how to assign reads to samples and samples to projects. Let’s check out the example sample sheet we downloaded.
cat cellranger-tiny-bcl-simple-1.2.0.csv Lane,Sample,Index 1,test_sample,SI-P03-C9
This is the simplest case where we have one lane of sequencing with one sample (test_sample) that has one sample index. Multiplexed sequencing allows multiple samples on one lane by introducing index sequences during sample preparation.
Let’s run the mkfastq step; adjust the location of the directory and files according, since you may have put them in different locations.
cellranger mkfastq --id=tiny-bcl \ --localcores=32 \ --localmem=200 \ --run=cellranger-tiny-bcl-1.2.0 \ --csv=cellranger-tiny-bcl-simple-1.2.0.csv # not all output is shown ... Outputs: - Run QC metrics: null - FASTQ output folder: /home/dtang/fd/test/tiny-bcl/outs/fastq_path - Interop output folder: /home/dtang/fd/test/tiny-bcl/outs/interop_path - Input samplesheet: /home/dtang/fd/test/tiny-bcl/outs/input_samplesheet.csv Waiting 6 seconds for UI to do final refresh. Pipestance completed successfully! Saving pipestance info to tiny-bcl/tiny-bcl.mri.tgz
The tiny-bcl folder contains several directories and files.
tree --charset=ascii -L 2 tiny-bcl/ tiny-bcl/ |-- _cmdline |-- _filelist |-- _finalstate |-- _invocation |-- _jobmode |-- _log |-- MAKE_FASTQS_CS | |-- fork0 | `-- MAKE_FASTQS |-- _mrosource |-- outs | |-- fastq_path | |-- input_samplesheet.csv | `-- interop_path |-- _perf |-- _sitecheck |-- _tags |-- _timestamp |-- tiny-bcl.mri.tgz |-- _uuid |-- _vdrkill `-- _versions 6 directories, 16 files
The FASTQ output can be found in the outs/fastq_path/ directory (default setting).
tree --charset=ascii -L 3 tiny-bcl/outs/fastq_path/ tiny-bcl/outs/fastq_path/ |-- H35KCBCXY | `-- test_sample | |-- test_sample_S1_L001_I1_001.fastq.gz | |-- test_sample_S1_L001_R1_001.fastq.gz | `-- test_sample_S1_L001_R2_001.fastq.gz |-- Reports | `-- html | `-- H35KCBCXY |-- Stats | |-- AdapterTrimming.txt | |-- ConversionStats.xml | |-- DemultiplexingStats.xml | |-- DemuxSummaryF1L1.txt | |-- FastqSummaryF1L1.txt | `-- Stats.json |-- Undetermined_S0_L001_I1_001.fastq.gz |-- Undetermined_S0_L001_R1_001.fastq.gz `-- Undetermined_S0_L001_R2_001.fastq.gz 6 directories, 12 files
The FASTQ files are named according to the sample column of the sample sheet. If a sample ID was not specified, the flow cell ID is used instead (not shown here). In addition to the FASTQ files, bcl2fastq generates various summary files. If –stats-dir was not specified, summary and statistic files will be stored in a Stats folder by default.
cellranger count quantifies single-cell gene expression. For this step, we’ll use the PBMC FASTQ files (pbmc8k_fastqs.tar) we downloaded right at the start of the post. This dataset was generated from Peripheral blood mononuclear cells (PBMCs), which are primary cells with relatively small amounts of RNA (~1pg RNA/cell), from a healthy donor and sequenced on an Illumina HiSeq 4000. The first read pair is 26 bp and contains the cell barcode and the UMI sequence (16bp Chromium barcode and 10bp UMI), the second read pair is 98 bp and contains the cDNA sequence, and there’s the 8 bp I7 sample barcodes.
We’ll extract the reads and check them out!
tar -xf pbmc8k_fastqs.tar ls -1 fastqs/ pbmc8k_S1_L007_I1_001.fastq.gz pbmc8k_S1_L007_R1_001.fastq.gz pbmc8k_S1_L007_R2_001.fastq.gz pbmc8k_S1_L008_I1_001.fastq.gz pbmc8k_S1_L008_R1_001.fastq.gz pbmc8k_S1_L008_R2_001.fastq.gz # read 1 contains the cell barcode and UMI tag # note the sample index in the header zcat fastqs/pbmc8k_S1_L007_R1_001.fastq.gz | head -8 @ST-K00126:314:HFYL2BBXX:7:1101:1631:1226 1:N:0:GTAATTGC GGGCACTAGCTGATAAGGGCCCAACG + A-AFFJA-AAJ<FF-F<<F-7FJJJJ @ST-K00126:314:HFYL2BBXX:7:1101:1834:1226 1:N:0:GTAATTGC CCGTGGACAGTGACAGATAACGCTTT + AAFF-AJ-A-AFJJ-FFFJJJFJJJF # read 2 contains cDNA sequence # note the sample index in the header zcat fastqs/pbmc8k_S1_L007_R2_001.fastq.gz | head -8 @ST-K00126:314:HFYL2BBXX:7:1101:1631:1226 2:N:0:GTAATTGC GNTGTGGCAGAGCAGCGACCCGCGGCGGGGCGGCATCCCCAGCTGGTTCGGGCCGGGACGGGGCGGCCAGCAGGGACGCGCCCCAGGGGGGCAGCTGT + A#-<<F7<AJF-FJ<JAAJJFJJ<AF-7AJF77<FJJJFFFJJ<JA-7-777<-F7<<F--7AA7AAAFF-AF<A-AFFA7J7F--7)-)7--7A<J- @ST-K00126:314:HFYL2BBXX:7:1101:1834:1226 2:N:0:GTAATTGC CNATGAGTGAAAGGGAGCCAGAAGACTGATTGGAGGGCCCTAGCTTGTGAGTGCGGCATCTCTTGGACTTTCCACCTGGTCATATACTCTGCAGCTGT + A#<-<FFAFFJJJ--FAFJJFJJ77FJ<FJFJAJ--<77AJ<<FAA-AFFJ--7FFAFAAF-A-777A7AFJJ7-7-77---<F-<FJJJJAFJJJAA # the index reads contain the sample index zcat fastqs/pbmc8k_S1_L007_I1_001.fastq.gz | head -8 @ST-K00126:314:HFYL2BBXX:7:1101:1631:1226 1:N:0:GTAATTGC GTAATTGC + AAAFFJFJ @ST-K00126:314:HFYL2BBXX:7:1101:1834:1226 1:N:0:GTAATTGC GTAATTGC + AAAFFJ<F
We can now run cellranger count on our FASTQ files. We will use the argument –cells=10000, which is the expected number of recovered cells. The pipeline will create a new directory based on the –id input; if this folder already exists, cellranger will assume it is an existing pipestance and attempt to resume running it. The –sample input needs to set to the name prefixed to our FASTQ files; in this run, I’ve set –id and –sample as pbmc8k. Other input arguments are self-explanatory.
cellranger count --id=pbmc8k \ --transcriptome=refdata-cellranger-GRCh38-1.2.0 \ --fastqs=fastqs/ \ --sample=pbmc8k \ --localcores=32 \ --expect-cells=10000 \ --localmem=200 ... Waiting 6 seconds for UI to do final refresh. Pipestance completed successfully! Saving pipestance info to pbmc8k/pbmc8k.mri.tgz
That took roughly 9.5 hours and the results are in the folder pbmc8k.
tree --charset=ascii -L 2 pbmc8k pbmc8k |-- _cmdline |-- _filelist |-- _finalstate |-- _invocation |-- _jobmode |-- _log |-- _mrosource |-- outs | |-- analysis | |-- cloupe.cloupe | |-- filtered_gene_bc_matrices | |-- filtered_gene_bc_matrices_h5.h5 | |-- metrics_summary.csv | |-- molecule_info.h5 | |-- possorted_genome_bam.bam | |-- possorted_genome_bam.bam.bai | |-- raw_gene_bc_matrices | |-- raw_gene_bc_matrices_h5.h5 | `-- web_summary.html |-- pbmc8k.mri.tgz |-- _perf |-- _perf._truncated_ |-- SC_RNA_COUNTER_CS | |-- CLOUPE_PREPROCESS | |-- fork0 | `-- SC_RNA_COUNTER |-- _sitecheck |-- _tags |-- _timestamp |-- _uuid |-- _vdrkill `-- _versions 8 directories, 24 files
Below is a description of the output directories and files.
|web_summary.html||Run summary metrics and charts in HTML format|
|metrics_summary.csv||Run summary metrics in CSV format|
|possorted_genome_bam.bam||Reads aligned to the genome and transcriptome annotated with barcode information|
|possorted_genome_bam.bam.bai||Index for possorted_genome_bam.bam|
|filtered_gene_bc_matrices||Filtered gene-barcode matrices containing only cellular barcodes in MEX format|
|filtered_gene_bc_matrices_h5.h5||Filtered gene-barcode matrices containing only cellular barcodes in HDF5 format|
|raw_gene_bc_matrices||Unfiltered gene-barcode matrices containing all barcodes in MEX format|
|raw_gene_bc_matrices_h5.h5||Unfiltered gene-barcode matrices containing all barcodes in HDF5 format|
|analysis||Secondary analysis data including dimensionality reduction, cell clustering, and differential expression|
|molecule_info.h5||Molecule-level information used by cellranger aggr to aggregate samples into larger datasets.|
|cloupe.cloupe||Loupe Cell Browser visualization and analysis file|
I’ve written a post describing the BAM file generated by cellranger count. The raw_gene_bc_matrices and filtered_gene_bc_matrices contain the gene expression profiles for all cell barcodes and filtered cell barcodes, respectively.
These matrices are available for download and we can compare our run results with the results available online.
# compare raw matrix wget http://cf.10xgenomics.com/samples/cell-exp/2.1.0/pbmc8k/pbmc8k_raw_gene_bc_matrices.tar.gz tar -xzf pbmc8k_raw_gene_bc_matrices.tar.gz tree --charset=ascii raw_gene_bc_matrices/ raw_gene_bc_matrices/ `-- GRCh38 |-- barcodes.tsv |-- genes.tsv `-- matrix.mtx 1 directory, 3 files # checksum for downloaded results md5sum raw_gene_bc_matrices/GRCh38/* f010059d7160aac2dad1805e1496bf71 raw_gene_bc_matrices/GRCh38/barcodes.tsv 49255c8490781e36052f33ac8fbf6c23 raw_gene_bc_matrices/GRCh38/genes.tsv 8a277e197570347905c0151bd9563405 raw_gene_bc_matrices/GRCh38/matrix.mtx # checksum for our run md5sum outs/raw_gene_bc_matrices/GRCh38/* f010059d7160aac2dad1805e1496bf71 outs/raw_gene_bc_matrices/GRCh38/barcodes.tsv 49255c8490781e36052f33ac8fbf6c23 outs/raw_gene_bc_matrices/GRCh38/genes.tsv 8a277e197570347905c0151bd9563405 outs/raw_gene_bc_matrices/GRCh38/matrix.mtx
Our raw matrix files are exactly the same as the ones online.
wget http://cf.10xgenomics.com/samples/cell-exp/2.1.0/pbmc8k/pbmc8k_filtered_gene_bc_matrices.tar.gz tar -xzf pbmc8k_filtered_gene_bc_matrices.tar.gz tree --charset=ascii filtered_gene_bc_matrices/ filtered_gene_bc_matrices/ `-- GRCh38 |-- barcodes.tsv |-- genes.tsv `-- matrix.mtx 1 directory, 3 files # checksum for downloaded results md5sum filtered_gene_bc_matrices/GRCh38/* 1613a13e62858867c6af36206ec72c99 filtered_gene_bc_matrices/GRCh38/barcodes.tsv 49255c8490781e36052f33ac8fbf6c23 filtered_gene_bc_matrices/GRCh38/genes.tsv 27845c1e8de085782b25e13980351e96 filtered_gene_bc_matrices/GRCh38/matrix.mtx # checksum for our run md5sum outs/filtered_gene_bc_matrices/GRCh38/* 1613a13e62858867c6af36206ec72c99 outs/filtered_gene_bc_matrices/GRCh38/barcodes.tsv 49255c8490781e36052f33ac8fbf6c23 outs/filtered_gene_bc_matrices/GRCh38/genes.tsv 27845c1e8de085782b25e13980351e96 outs/filtered_gene_bc_matrices/GRCh38/matrix.mtx
Our filtered matrix files are exactly the same as well.
The difference between the raw and the filtered gene matrices is that the raw matrices contain all cell barcodes and the filtered dataset only contains cell barcodes that Cell Ranger considers to be real, i.e. droplets that contain a cell. The total UMI of a cell barcode is used to rank the barcodes and this is plotted on a log scale (from highest to lowest) to determine the UMI threshold for filtering.
The plot above is generated by the cellranger count step; it can be view here. The assumption is that cell barcodes with high UMI counts contain real cells and those with low UMI counts don’t. The drop-off point in the plot is used to determine the threshold.
An alternative method for determining whether droplets contain a cell or just ambient RNA (which some refer to as “soup”) is implemented in the DropletUtils package. Briefly, the method takes all cell barcodes with a UMI count less than 100 (default) and generates a proportion value for each gene. Next, all cell barcodes with a UMI count higher than 100 are tested against this proportion vector; cell barcodes with a significant deviation from the ambient profile are considered droplets with real cells.
To get started, install the package.
The package uses the read10xCounts function to load 10x data and stores it as a SingleCellExperiment object. The argument is the folder containing the barcodes.tsv, genes.tsv, and matrix.mtx files.
library("DropletUtils") sce <- read10xCounts("raw_gene_bc_matrices/GRCh38/") sce class: SingleCellExperiment dim: 33694 737280 metadata(0): assays(1): counts rownames(33694): ENSG00000243485 ENSG00000237613 ... ENSG00000277475 ENSG00000268674 rowData names(2): ID Symbol colnames: NULL colData names(2): Sample Barcode reducedDimNames(0): spikeNames(0):
The barcodeRanks function is then used to rank the cell barcodes by its total UMI count; the function also calculates the drop-off points.
br.out <- barcodeRanks(counts(sce)) library("dplyr") # I'll convert the list into a data frame # and add the cell barcodes to the data frame br.out.df <- as.data.frame(br.out) br.out.df$barcode <- colData(sce)$Barcode br.out.df %>% filter(rank <= 10) %>% arrange(rank) rank total fitted knee inflection barcode 1 1 31785 NA 3091 605 GCAGTTAGTTCCATGA-1 2 2 26321 NA 3091 605 CCTTCCCTCATCTGTT-1 3 3 25925 NA 3091 605 AGATCTGGTGTGCGTC-1 4 4 24530 NA 3091 605 ACCCACTAGATCGATA-1 5 5 22883 NA 3091 605 GTTAAGCGTACCTACA-1 6 6 21744 NA 3091 605 GGCTCGAAGAGGTAGA-1 7 7 21446 NA 3091 605 GATCGCGTCCTGCTTG-1 8 8 21086 NA 3091 605 TGCGGGTGTACCAGTT-1 9 9 21005 NA 3091 605 AGGTCCGAGCTAACTC-1 10 10 20279 NA 3091 605 GTGTGCGCATGCCACG-1
We can produce a barcode rank vs. total UMI plot similar to the one in the web summary report.
library(ggplot2) x_knee <- br.out.df %>% filter(total > br.out$knee) %>% arrange(total) %>% select(rank) %>% head(1) x_inflection <- br.out.df %>% filter(total > br.out$inflection) %>% arrange(total) %>% select(rank) %>% head(1) padding <- length(br.out$rank) / 10 ggplot(br.out.df, aes(x = rank, y = total)) + geom_point() + scale_x_log10() + scale_y_log10() + theme_bw() + theme(axis.text = element_text(size = 10), axis.title = element_text(size = 14), title = element_text(size = 16)) + geom_hline(yintercept = br.out$knee, linetype = 2, colour = "dodgerblue") + geom_hline(yintercept = br.out$inflection, linetype = 2, colour = "forestgreen") + labs(x = "Rank", y = "Total", title = "Barcode Rank vs Total UMI") + annotate("text", label = paste0("Knee (", x_knee, ")"), x = x_knee$rank + padding, y = br.out$knee, size = 5) + annotate("text", label = paste0("Inflection (", x_inflection, ")"), x = x_inflection$rank + padding, y = br.out$inflection, size = 5)
Now to test whether cell barcodes with >100 total UMIs have the same gene profile as “ambient” cell barcodes.
set.seed(100) e.out <- emptyDrops(counts(sce)) # use FDR threshold of 0.01 is.cell <- e.out$FDR <= 0.01 # replace all NA's with FALSE is.cell.no.na <- is.cell is.cell.no.na[is.na(is.cell)] <- FALSE sum(is.cell.no.na)  8482 ggplot(br.out.df[is.cell.no.na,], aes(x = rank, y = total)) + geom_point() + scale_x_log10() + scale_y_log10() + theme_bw() + theme(axis.text = element_text(size = 10), axis.title = element_text(size = 14), title = element_text(size = 16)) + labs(x = "Rank", y = "Total", title = "Cell barcodes that differ from ambient RNA")
We can compare this to what Cell Ranger considers real cell droplets by loading the filtered results; you need the readr and gplots packages.
my_filtered_bc <- readr::read_tsv("filtered_gene_bc_matrices/GRCh38/barcodes.tsv", col_names = "barcode") gplots::venn(list(DropletUtils = br.out.df$barcode[is.cell.no.na], "Cell Ranger" = my_filtered_bc$barcode))
Creating your own reference
The human reference (GRCh38) dataset provided by 10x Genomics contains the FASTA sequence of hg38, a GTF file of genes, a pickle file (not sure what that is), and index files for STAR.
tree --charset=ascii refdata-cellranger-GRCh38-1.2.0 refdata-cellranger-GRCh38-1.2.0 |-- fasta | `-- genome.fa |-- genes | `-- genes.gtf |-- pickle | `-- genes.pickle |-- README.BEFORE.MODIFYING |-- reference.json |-- star | |-- chrLength.txt | |-- chrNameLength.txt | |-- chrName.txt | |-- chrStart.txt | |-- exonGeTrInfo.tab | |-- exonInfo.tab | |-- geneInfo.tab | |-- Genome | |-- genomeParameters.txt | |-- SA | |-- SAindex | |-- sjdbInfo.txt | |-- sjdbList.fromGTF.out.tab | |-- sjdbList.out.tab | `-- transcriptInfo.tab `-- version 4 directories, 21 files
However, if you are not working with human or mouse data, you will need to create your own reference files. Or you may be working with nuclei data, which contains a nascent mRNAs that are unspliced and contain introns; cellranger count will only count reads mapping to exons. Cell Ranger provides two utilities for creating your own references: cellranger mkgtf and cellranger mkref. You require a reference genome sequence (FASTA) and gene annotations (GTF). The guide is straightforward enough to follow though you will need to decide which biotypes to include in your GTF file.
Cell Ranger is the software provided by 10x Genomics to process Chromium single cell 3′ RNA-seq data. There are four pipelines included in the software package but this post only described the first two. One issue with droplet based scRNA-seq methods is whether a droplet contains a cell (or even multiple cells, which should be another post) or just ambient RNA. DropletUtils is a package that can be used to determine whether droplets contained a cell.
This work is licensed under a Creative Commons
Attribution 4.0 International License.