Getting started with Cell Ranger

Cell Ranger is a set of analysis pipelines that process Chromium single cell 3′ RNA-seq data. The pipelines process raw sequencing output, performs read alignment, generate gene-cell matrices, and can perform downstream analyses such as clustering and gene expression analysis. Cell Ranger includes four pipelines:

  • cellranger mkfastq
  • cellranger count
  • cellranger aggr
  • cellranger reanalyze

You can download Cell Ranger from their software download page. Conveniently, Cell Ranger is provided as a single self-contained file that bundles all its own software dependencies. You can view the source at their GitHub repository.

This post was started almost a year ago and it’s still missing the cellranger aggr and cellranger reanalyze steps. I’ve decided to post it anyway, since I haven’t had a post for two months.

Downloading data

First, start downloading the FASTQ files (73.61 GB) that we will use later in the post; they are quite large and depending on your Internet speed, may take up to several hours.

wget -c -N

Next, visit the download page to generate your own download link for the Cell Ranger tarball.

wget -O cellranger-2.2.0.tar.gz ""

Download the hg38 reference tarball.


Download the BCL tarball and sample sheet.


Test run

Cell Ranger can perform a test run to verify the installation. We’ll extract the tarball and perform the test run. Follow the official installation instructions for setting up Cell Ranger.

tar -xzf cellranger-2.2.0.tar.gz

# test run
cellranger testrun --id=testrun \
                   --localcores=32 \

# if everything worked you should see the following
Pipestance completed successfully!

Saving pipestance info to testrun/testrun.mri.tgz

Note that Cell Ranger runs locally and will use 90% of available memory and cores by default. To restrict resource usage, use the command line arguments –localmem and –localcores, as above.

--localcores=NUM    Set max cores the pipeline may request at one time.
                            Only applies when --jobmode=local.
--localmem=NUM      Set max GB the pipeline may request at one time.
                            Only applies when --jobmode=local.

Each core that Cell Ranger uses will spawn 64 user processes, so you may run into problems if your system has a limit on the max user processes. You can use ulimit -u to find out the limit.

ulimit -u

If you have a limit of 4096, set –localcores to 64 or less since 64 * 64 = 4096.


Illumina sequencing instruments generate per-cycle raw base call (BCL) files as primary sequencing output. The cellranger mkfastq pipeline performs the demultiplexing and conversion step for BCL files for each flowcell directory.

This step requires Illumina’s bcl2fastq tool, which needs to be set up prior to running mkfastq. I have compiled v2.20.0.422 in my bin directory, which has been added to my PATH.

We have already downloaded some sample BCL files in the download section; let’s check it out.

tar -xzf cellranger-tiny-bcl-1.2.0.tar.gz

tree --charset=ascii -L 3 cellranger-tiny-bcl-1.2.0 
|-- Config
|   |-- h35kcbcxy_Oct-20-16\ 15-51-48_Effective.cfg
|   |-- HiSeqControlSoftware.Options.cfg
|   |-- RTAStart.bat
|   `-- Variability_HiSeq_C.bin
|-- Data
|   `-- Intensities
|       |-- BaseCalls
|       |-- config.xml
|       |-- L001
|       |-- Offsets
|       `-- RTAConfiguration.xml
|-- InterOp
|   |-- ControlMetricsOut.bin
|   |-- CorrectedIntMetricsOut.bin
|   |-- ErrorMetricsOut.bin
|   |-- ExtractionMetricsOut.bin
|   |-- ImageMetricsOut.bin
|   |-- QMetricsOut.bin
|   `-- TileMetricsOut.bin
|-- RTAComplete.txt
|-- RunInfo.xml
`-- runParameters.xml

7 directories, 16 files

This is the typical directory structure generated by Illumina systems; the image below is for MiniSeq or NextSeq systems but other systems are quite similar.

Figure from the bcl2fastq2 Conversion Software v2.19 User Guide.

A sample sheet directs the mkfastq step on how to assign reads to samples and samples to projects. Let’s check out the example sample sheet we downloaded.

cat cellranger-tiny-bcl-simple-1.2.0.csv

This is the simplest case where we have one lane of sequencing with one sample (test_sample) that has one sample index. Multiplexed sequencing allows multiple samples on one lane by introducing index sequences during sample preparation.

Let’s run the mkfastq step; adjust the location of the directory and files according, since you may have put them in different locations.

cellranger mkfastq --id=tiny-bcl \
                   --localcores=32 \
                   --localmem=200 \
                   --run=cellranger-tiny-bcl-1.2.0 \

# not all output is shown

- Run QC metrics:        null
- FASTQ output folder:   /home/dtang/fd/test/tiny-bcl/outs/fastq_path
- Interop output folder: /home/dtang/fd/test/tiny-bcl/outs/interop_path
- Input samplesheet:     /home/dtang/fd/test/tiny-bcl/outs/input_samplesheet.csv

Waiting 6 seconds for UI to do final refresh.
Pipestance completed successfully!

Saving pipestance info to tiny-bcl/tiny-bcl.mri.tgz

The tiny-bcl folder contains several directories and files.

tree --charset=ascii -L 2 tiny-bcl/
|-- _cmdline
|-- _filelist
|-- _finalstate
|-- _invocation
|-- _jobmode
|-- _log
|   |-- fork0
|-- _mrosource
|-- outs
|   |-- fastq_path
|   |-- input_samplesheet.csv
|   `-- interop_path
|-- _perf
|-- _sitecheck
|-- _tags
|-- _timestamp
|-- tiny-bcl.mri.tgz
|-- _uuid
|-- _vdrkill
`-- _versions

6 directories, 16 files

The FASTQ output can be found in the outs/fastq_path/ directory (default setting).

tree --charset=ascii -L 3 tiny-bcl/outs/fastq_path/
|   `-- test_sample
|       |-- test_sample_S1_L001_I1_001.fastq.gz
|       |-- test_sample_S1_L001_R1_001.fastq.gz
|       `-- test_sample_S1_L001_R2_001.fastq.gz
|-- Reports
|   `-- html
|       `-- H35KCBCXY
|-- Stats
|   |-- AdapterTrimming.txt
|   |-- ConversionStats.xml
|   |-- DemultiplexingStats.xml
|   |-- DemuxSummaryF1L1.txt
|   |-- FastqSummaryF1L1.txt
|   `-- Stats.json
|-- Undetermined_S0_L001_I1_001.fastq.gz
|-- Undetermined_S0_L001_R1_001.fastq.gz
`-- Undetermined_S0_L001_R2_001.fastq.gz

6 directories, 12 files

The FASTQ files are named according to the sample column of the sample sheet. If a sample ID was not specified, the flow cell ID is used instead (not shown here). In addition to the FASTQ files, bcl2fastq generates various summary files. If –stats-dir was not specified, summary and statistic files will be stored in a Stats folder by default.


cellranger count quantifies single-cell gene expression. For this step, we’ll use the PBMC FASTQ files (pbmc8k_fastqs.tar) we downloaded right at the start of the post. This dataset was generated from Peripheral blood mononuclear cells (PBMCs), which are primary cells with relatively small amounts of RNA (~1pg RNA/cell), from a healthy donor and sequenced on an Illumina HiSeq 4000. The first read pair is 26 bp and contains the cell barcode and the UMI sequence (16bp Chromium barcode and 10bp UMI), the second read pair is 98 bp and contains the cDNA sequence, and there’s the 8 bp I7 sample barcodes.

We’ll extract the reads and check them out!

tar -xf pbmc8k_fastqs.tar

ls -1 fastqs/

# read 1 contains the cell barcode and UMI tag
# note the sample index in the header
zcat fastqs/pbmc8k_S1_L007_R1_001.fastq.gz | head -8
@ST-K00126:314:HFYL2BBXX:7:1101:1631:1226 1:N:0:GTAATTGC
@ST-K00126:314:HFYL2BBXX:7:1101:1834:1226 1:N:0:GTAATTGC

# read 2 contains cDNA sequence
# note the sample index in the header
zcat fastqs/pbmc8k_S1_L007_R2_001.fastq.gz | head -8
@ST-K00126:314:HFYL2BBXX:7:1101:1631:1226 2:N:0:GTAATTGC
@ST-K00126:314:HFYL2BBXX:7:1101:1834:1226 2:N:0:GTAATTGC

# the index reads contain the sample index
zcat fastqs/pbmc8k_S1_L007_I1_001.fastq.gz | head -8
@ST-K00126:314:HFYL2BBXX:7:1101:1631:1226 1:N:0:GTAATTGC
@ST-K00126:314:HFYL2BBXX:7:1101:1834:1226 1:N:0:GTAATTGC

We can now run cellranger count on our FASTQ files. We will use the argument –cells=10000, which is the expected number of recovered cells. The pipeline will create a new directory based on the –id input; if this folder already exists, cellranger will assume it is an existing pipestance and attempt to resume running it. The –sample input needs to set to the name prefixed to our FASTQ files; in this run, I’ve set –id and –sample as pbmc8k. Other input arguments are self-explanatory.

cellranger count --id=pbmc8k \
                 --transcriptome=refdata-cellranger-GRCh38-1.2.0 \
                 --fastqs=fastqs/ \
                 --sample=pbmc8k \
                 --localcores=32 \
                 --expect-cells=10000 \

Waiting 6 seconds for UI to do final refresh.
Pipestance completed successfully!

Saving pipestance info to pbmc8k/pbmc8k.mri.tgz

That took roughly 9.5 hours and the results are in the folder pbmc8k.

tree --charset=ascii -L 2 pbmc8k
|-- _cmdline
|-- _filelist
|-- _finalstate
|-- _invocation
|-- _jobmode
|-- _log
|-- _mrosource
|-- outs
|   |-- analysis
|   |-- cloupe.cloupe
|   |-- filtered_gene_bc_matrices
|   |-- filtered_gene_bc_matrices_h5.h5
|   |-- metrics_summary.csv
|   |-- molecule_info.h5
|   |-- possorted_genome_bam.bam
|   |-- possorted_genome_bam.bam.bai
|   |-- raw_gene_bc_matrices
|   |-- raw_gene_bc_matrices_h5.h5
|   `-- web_summary.html
|-- pbmc8k.mri.tgz
|-- _perf
|-- _perf._truncated_
|   |-- fork0
|-- _sitecheck
|-- _tags
|-- _timestamp
|-- _uuid
|-- _vdrkill
`-- _versions

8 directories, 24 files

Below is a description of the output directories and files.

File Name Description
web_summary.html Run summary metrics and charts in HTML format
metrics_summary.csv Run summary metrics in CSV format
possorted_genome_bam.bam Reads aligned to the genome and transcriptome annotated with barcode information
possorted_genome_bam.bam.bai Index for possorted_genome_bam.bam
filtered_gene_bc_matrices Filtered gene-barcode matrices containing only cellular barcodes in MEX format
filtered_gene_bc_matrices_h5.h5 Filtered gene-barcode matrices containing only cellular barcodes in HDF5 format
raw_gene_bc_matrices Unfiltered gene-barcode matrices containing all barcodes in MEX format
raw_gene_bc_matrices_h5.h5 Unfiltered gene-barcode matrices containing all barcodes in HDF5 format
analysis Secondary analysis data including dimensionality reduction, cell clustering, and differential expression
molecule_info.h5 Molecule-level information used by cellranger aggr to aggregate samples into larger datasets.
cloupe.cloupe Loupe Cell Browser visualization and analysis file

I’ve written a post describing the BAM file generated by cellranger count. The raw_gene_bc_matrices and filtered_gene_bc_matrices contain the gene expression profiles for all cell barcodes and filtered cell barcodes, respectively.

These matrices are available for download and we can compare our run results with the results available online.

# compare raw matrix
tar -xzf pbmc8k_raw_gene_bc_matrices.tar.gz

tree --charset=ascii raw_gene_bc_matrices/
`-- GRCh38
    |-- barcodes.tsv
    |-- genes.tsv
    `-- matrix.mtx

1 directory, 3 files

# checksum for downloaded results
md5sum raw_gene_bc_matrices/GRCh38/*
f010059d7160aac2dad1805e1496bf71  raw_gene_bc_matrices/GRCh38/barcodes.tsv
49255c8490781e36052f33ac8fbf6c23  raw_gene_bc_matrices/GRCh38/genes.tsv
8a277e197570347905c0151bd9563405  raw_gene_bc_matrices/GRCh38/matrix.mtx

# checksum for our run
md5sum outs/raw_gene_bc_matrices/GRCh38/*
f010059d7160aac2dad1805e1496bf71  outs/raw_gene_bc_matrices/GRCh38/barcodes.tsv
49255c8490781e36052f33ac8fbf6c23  outs/raw_gene_bc_matrices/GRCh38/genes.tsv
8a277e197570347905c0151bd9563405  outs/raw_gene_bc_matrices/GRCh38/matrix.mtx

Our raw matrix files are exactly the same as the ones online.

tar -xzf pbmc8k_filtered_gene_bc_matrices.tar.gz

tree --charset=ascii filtered_gene_bc_matrices/
`-- GRCh38
    |-- barcodes.tsv
    |-- genes.tsv
    `-- matrix.mtx

1 directory, 3 files

# checksum for downloaded results
md5sum filtered_gene_bc_matrices/GRCh38/*
1613a13e62858867c6af36206ec72c99  filtered_gene_bc_matrices/GRCh38/barcodes.tsv
49255c8490781e36052f33ac8fbf6c23  filtered_gene_bc_matrices/GRCh38/genes.tsv
27845c1e8de085782b25e13980351e96  filtered_gene_bc_matrices/GRCh38/matrix.mtx

# checksum for our run
md5sum outs/filtered_gene_bc_matrices/GRCh38/*
1613a13e62858867c6af36206ec72c99  outs/filtered_gene_bc_matrices/GRCh38/barcodes.tsv
49255c8490781e36052f33ac8fbf6c23  outs/filtered_gene_bc_matrices/GRCh38/genes.tsv
27845c1e8de085782b25e13980351e96  outs/filtered_gene_bc_matrices/GRCh38/matrix.mtx

Our filtered matrix files are exactly the same as well.


The difference between the raw and the filtered gene matrices is that the raw matrices contain all cell barcodes and the filtered dataset only contains cell barcodes that Cell Ranger considers to be real, i.e. droplets that contain a cell. The total UMI of a cell barcode is used to rank the barcodes and this is plotted on a log scale (from highest to lowest) to determine the UMI threshold for filtering.

The plot above is generated by the cellranger count step; it can be view here. The assumption is that cell barcodes with high UMI counts contain real cells and those with low UMI counts don’t. The drop-off point in the plot is used to determine the threshold.

An alternative method for determining whether droplets contain a cell or just ambient RNA (which some refer to as “soup”) is implemented in the DropletUtils package. Briefly, the method takes all cell barcodes with a UMI count less than 100 (default) and generates a proportion value for each gene. Next, all cell barcodes with a UMI count higher than 100 are tested against this proportion vector; cell barcodes with a significant deviation from the ambient profile are considered droplets with real cells.

To get started, install the package.


The package uses the read10xCounts function to load 10x data and stores it as a SingleCellExperiment object. The argument is the folder containing the barcodes.tsv, genes.tsv, and matrix.mtx files.


sce <- read10xCounts("raw_gene_bc_matrices/GRCh38/")

class: SingleCellExperiment 
dim: 33694 737280 
assays(1): counts
rownames(33694): ENSG00000243485 ENSG00000237613 ... ENSG00000277475 ENSG00000268674
rowData names(2): ID Symbol
colnames: NULL
colData names(2): Sample Barcode

The barcodeRanks function is then used to rank the cell barcodes by its total UMI count; the function also calculates the drop-off points.

br.out <- barcodeRanks(counts(sce))

# I'll convert the list into a data frame
# and add the cell barcodes to the data frame
br.out.df <-
br.out.df$barcode <- colData(sce)$Barcode
br.out.df %>%
  filter(rank <= 10) %>%

   rank total fitted knee inflection            barcode
1     1 31785     NA 3091        605 GCAGTTAGTTCCATGA-1
2     2 26321     NA 3091        605 CCTTCCCTCATCTGTT-1
3     3 25925     NA 3091        605 AGATCTGGTGTGCGTC-1
4     4 24530     NA 3091        605 ACCCACTAGATCGATA-1
5     5 22883     NA 3091        605 GTTAAGCGTACCTACA-1
6     6 21744     NA 3091        605 GGCTCGAAGAGGTAGA-1
7     7 21446     NA 3091        605 GATCGCGTCCTGCTTG-1
8     8 21086     NA 3091        605 TGCGGGTGTACCAGTT-1
9     9 21005     NA 3091        605 AGGTCCGAGCTAACTC-1
10   10 20279     NA 3091        605 GTGTGCGCATGCCACG-1

We can produce a barcode rank vs. total UMI plot similar to the one in the web summary report.

x_knee <- br.out.df %>% filter(total > br.out$knee) %>% arrange(total) %>% select(rank) %>% head(1)
x_inflection <- br.out.df %>% filter(total > br.out$inflection) %>% arrange(total) %>% select(rank) %>% head(1)
padding <- length(br.out$rank) / 10

ggplot(br.out.df, aes(x = rank, y = total)) +
  geom_point() +
  scale_x_log10() +
  scale_y_log10() +
  theme_bw() +
  theme(axis.text = element_text(size = 10),
        axis.title = element_text(size = 14),
        title = element_text(size = 16)) +
  geom_hline(yintercept = br.out$knee, linetype = 2, colour = "dodgerblue") +
  geom_hline(yintercept = br.out$inflection, linetype = 2, colour = "forestgreen") +
  labs(x = "Rank", y = "Total", title = "Barcode Rank vs Total UMI") +
  annotate("text", label = paste0("Knee (", x_knee, ")"), x = x_knee$rank + padding, y = br.out$knee, size = 5) +
  annotate("text", label = paste0("Inflection (", x_inflection, ")"), x = x_inflection$rank + padding, y = br.out$inflection, size = 5)

If we filter at a total UMI count of 3091, we get 6,705 cell barcodes; at a total UMI of 605, we get 8,719 cell barcodes.

Now to test whether cell barcodes with >100 total UMIs have the same gene profile as “ambient” cell barcodes.

e.out <- emptyDrops(counts(sce))

# use FDR threshold of 0.01
is.cell <- e.out$FDR <= 0.01

# replace all NA's with FALSE <- is.cell[] <- FALSE
[1] 8482

ggplot(br.out.df[,], aes(x = rank, y = total)) +
  geom_point() +
  scale_x_log10() +
  scale_y_log10() +
  theme_bw() +
  theme(axis.text = element_text(size = 10),
        axis.title = element_text(size = 14),
        title = element_text(size = 16)) +
  labs(x = "Rank", y = "Total", title = "Cell barcodes that differ from ambient RNA")

Cell barcodes that likely contain a real cell. Note that some low UMI cell barcodes are considered to contain cells.

We can compare this to what Cell Ranger considers real cell droplets by loading the filtered results; you need the readr and gplots packages.

my_filtered_bc <- readr::read_tsv("filtered_gene_bc_matrices/GRCh38/barcodes.tsv", col_names = "barcode")
gplots::venn(list(DropletUtils = br.out.df$barcode[], "Cell Ranger" = my_filtered_bc$barcode))

There’s a proportion of cell barcodes with low UMIs that differ enough from the ambient profile that DropletUtils considers to contain a cell.

Creating your own reference

The human reference (GRCh38) dataset provided by 10x Genomics contains the FASTA sequence of hg38, a GTF file of genes, a pickle file (not sure what that is), and index files for STAR.

tree --charset=ascii refdata-cellranger-GRCh38-1.2.0
|-- fasta
|   `-- genome.fa
|-- genes
|   `-- genes.gtf
|-- pickle
|   `-- genes.pickle
|-- reference.json
|-- star
|   |-- chrLength.txt
|   |-- chrNameLength.txt
|   |-- chrName.txt
|   |-- chrStart.txt
|   |--
|   |--
|   |--
|   |-- Genome
|   |-- genomeParameters.txt
|   |-- SA
|   |-- SAindex
|   |-- sjdbInfo.txt
|   |--
|   |--
|   `--
`-- version

4 directories, 21 files

However, if you are not working with human or mouse data, you will need to create your own reference files. Or you may be working with nuclei data, which contains a nascent mRNAs that are unspliced and contain introns; cellranger count will only count reads mapping to exons. Cell Ranger provides two utilities for creating your own references: cellranger mkgtf and cellranger mkref. You require a reference genome sequence (FASTA) and gene annotations (GTF). The guide is straightforward enough to follow though you will need to decide which biotypes to include in your GTF file.


Cell Ranger is the software provided by 10x Genomics to process Chromium single cell 3′ RNA-seq data. There are four pipelines included in the software package but this post only described the first two. One issue with droplet based scRNA-seq methods is whether a droplet contains a cell (or even multiple cells, which should be another post) or just ambient RNA. DropletUtils is a package that can be used to determine whether droplets contained a cell.

Print Friendly, PDF & Email

Creative Commons License
This work is licensed under a Creative Commons
Attribution 4.0 International License
7 comments Add yours
  1. Thank you for the wonderful tutorial. I wonder what “S1_L001_I1_001” is the fastq file represents? Thanks

  2. Hi,
    Love your blog, it is very helpful. I was wondering where to find the information about what R1, R2 and I1 files are after making FASTQs.
    Thank you,

  3. Hey can you tell what will be the ‘–sample’ if the fastqs are not generated by cellranger and do not follow the naming convention of cellranger?

  4. Hi Dave, this is very helpful thank you. Can you give me an idea of how long this whole analysis will take? I am just asking because I need to reserve some time on a shared computer to run this pipeline and wanted to know how much time I need to put aside. Thanks!

    1. Hi Esther,

      the count step takes the longest because of read mapping. Factors that affect the amount of computing time you need to reserve:

      * How many reads are you trying to process? And what reference genome are you mapping to?
      * How many CPUs (cores) do you have?
      * Do you have access to “scratch” space, which provides faster data access?

      I can’t estimate how long but it could take up to 24 hours; try it on your own dataset and see how long it takes.


  5. Any idea how to make it run flawlessly using slurum?

    I tried one sample (deeply sequenced) with 20 cores and 690GB RAM ,but its shutting down after 1.5 h with following error

    2020-02-05 14:27:06 [jobmngr] Highest memory usage observed: {
    “rss”: 14946304,
    “shared”: 3981312,
    “vmem”: 339881984,
    “text”: 4599808,
    “stack”: 319549440,
    “proc_count”: 51

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.