Getting started with Cell Ranger

Cell Ranger is a set of analysis pipelines that process Chromium single cell 3' RNA-seq data. The pipelines process raw sequencing output, performs read alignment, generate gene-cell matrices, and can perform downstream analyses such as clustering and gene expression analysis. Cell Ranger includes four pipelines:

  • cellranger mkfastq
  • cellranger count
  • cellranger aggr
  • cellranger reanalyze

You can download Cell Ranger from their software download page. Conveniently, Cell Ranger is provided as a single self-contained file that bundles all its own software dependencies. You can view the source at their GitHub repository.

This post was started almost a year ago and it's still missing the cellranger aggr and cellranger reanalyze steps. I've decided to post it anyway, since I haven't had a post for two months.

Downloading data

First, start downloading the FASTQ files (73.61 GB) that we will use later in the post; they are quite large and depending on your Internet speed, may take up to several hours.

wget -c -N http://s3-us-west-2.amazonaws.com/10x.files/samples/cell-exp/2.1.0/pbmc8k/pbmc8k_fastqs.tar

Next, visit the download page to generate your own download link for the Cell Ranger tarball.

wget -O cellranger-2.2.0.tar.gz "http://cf.10xgenomics.com/releases/cell-exp/cellranger-2.2.0.tar.gz?Expires=1533651784&Policy=eyJTdGF0ZW1lbnQiOlt7IlJlc291cmNlIjoiaHR0cDovL2NmLjEweGdlbm9taWNzLmNvbS9yZWxlYXNlcy9jZWxsLWV4cC9jZWxscmFuZ2VyLTIuMi4wLnRhci5neiIsIkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTUzMzY1MTc4NH19fV19&Signature=L5k81QJRjubjoBwUeEcLTDT3uuuixkX4ZEJsNqO2ZLfe~cM2ygqHAJBHQ00xJxAdNPOyY8vu0RBaApGZM1H0-OqdqWg6rK4gdO6EF8lJu~oWv97X7zw6EfhVwISUFyoK22WFYR-MAKZHwxiIn8C3vORnZcxxUuN~sq0guVZ5ievI5iFYrHRhumQQhkQr~2fkbQDp3Bd5wtbyflfwMnE7fpdgy~jTSIbMLbHUICnM3-Gc~flroyEcoyugenrcSHNAcO~F0l~gmOewZnqqbEshxHBU4-faJ1iBixhJBlGHVHagqqpq6xyktVdAEIbX6uM~zawmHgi3axDh4HXffwMNiw__&Key-Pair-Id=APKAI7S6A5RYOXBWRPDA"

Download the hg38 reference tarball.

wget http://cf.10xgenomics.com/supp/cell-exp/refdata-cellranger-GRCh38-1.2.0.tar.gz

Download the BCL tarball and sample sheet.

wget http://cf.10xgenomics.com/supp/cell-exp/cellranger-tiny-bcl-1.2.0.tar.gz
wget http://cf.10xgenomics.com/supp/cell-exp/cellranger-tiny-bcl-simple-1.2.0.csv

Test run

Cell Ranger can perform a test run to verify the installation. We'll extract the tarball and perform the test run. Follow the official installation instructions for setting up Cell Ranger.

tar -xzf cellranger-2.2.0.tar.gz

# test run
cellranger testrun --id=testrun \
                   --localcores=32 \
                   --localmem=200

# if everything worked you should see the following
Pipestance completed successfully!

Saving pipestance info to testrun/testrun.mri.tgz

Note that Cell Ranger runs locally and will use 90% of available memory and cores by default. To restrict resource usage, use the command line arguments --localmem and --localcores, as above.

--localcores=NUM    Set max cores the pipeline may request at one time.
                            Only applies when --jobmode=local.
--localmem=NUM      Set max GB the pipeline may request at one time.
                            Only applies when --jobmode=local.

Each core that Cell Ranger uses will spawn 64 user processes, so you may run into problems if your system has a limit on the max user processes. You can use ulimit -u to find out the limit.

ulimit -u
4096

If you have a limit of 4096, set --localcores to 64 or less since 64 * 64 = 4096.

mkfastq

Illumina sequencing instruments generate per-cycle raw base call (BCL) files as primary sequencing output. The cellranger mkfastq pipeline performs the demultiplexing and conversion step for BCL files for each flowcell directory.

This step requires Illumina's bcl2fastq tool, which needs to be set up prior to running mkfastq. I have compiled v2.20.0.422 in my bin directory, which has been added to my PATH.

We have already downloaded some sample BCL files in the download section; let's check it out.

tar -xzf cellranger-tiny-bcl-1.2.0.tar.gz

tree --charset=ascii -L 3 cellranger-tiny-bcl-1.2.0 
cellranger-tiny-bcl-1.2.0
|-- Config
|   |-- h35kcbcxy_Oct-20-16\ 15-51-48_Effective.cfg
|   |-- HiSeqControlSoftware.Options.cfg
|   |-- RTAStart.bat
|   `-- Variability_HiSeq_C.bin
|-- Data
|   `-- Intensities
|       |-- BaseCalls
|       |-- config.xml
|       |-- L001
|       |-- Offsets
|       `-- RTAConfiguration.xml
|-- InterOp
|   |-- ControlMetricsOut.bin
|   |-- CorrectedIntMetricsOut.bin
|   |-- ErrorMetricsOut.bin
|   |-- ExtractionMetricsOut.bin
|   |-- ImageMetricsOut.bin
|   |-- QMetricsOut.bin
|   `-- TileMetricsOut.bin
|-- RTAComplete.txt
|-- RunInfo.xml
`-- runParameters.xml

7 directories, 16 files

This is the typical directory structure generated by Illumina systems; the image below is for MiniSeq or NextSeq systems but other systems are quite similar.

Figure from the bcl2fastq2 Conversion Software v2.19 User Guide.

A sample sheet directs the mkfastq step on how to assign reads to samples and samples to projects. Let's check out the example sample sheet we downloaded.

cat cellranger-tiny-bcl-simple-1.2.0.csv
Lane,Sample,Index
1,test_sample,SI-P03-C9

This is the simplest case where we have one lane of sequencing with one sample (test_sample) that has one sample index. Multiplexed sequencing allows multiple samples on one lane by introducing index sequences during sample preparation.

Let's run the mkfastq step; adjust the location of the directory and files according, since you may have put them in different locations.

cellranger mkfastq --id=tiny-bcl \
                   --localcores=32 \
                   --localmem=200 \
                   --run=cellranger-tiny-bcl-1.2.0 \
                   --csv=cellranger-tiny-bcl-simple-1.2.0.csv

# not all output is shown
...

Outputs:
- Run QC metrics:        null
- FASTQ output folder:   /home/dtang/fd/test/tiny-bcl/outs/fastq_path
- Interop output folder: /home/dtang/fd/test/tiny-bcl/outs/interop_path
- Input samplesheet:     /home/dtang/fd/test/tiny-bcl/outs/input_samplesheet.csv

Waiting 6 seconds for UI to do final refresh.
Pipestance completed successfully!

Saving pipestance info to tiny-bcl/tiny-bcl.mri.tgz

The tiny-bcl folder contains several directories and files.

tree --charset=ascii -L 2 tiny-bcl/
tiny-bcl/
|-- _cmdline
|-- _filelist
|-- _finalstate
|-- _invocation
|-- _jobmode
|-- _log
|-- MAKE_FASTQS_CS
|   |-- fork0
|   `-- MAKE_FASTQS
|-- _mrosource
|-- outs
|   |-- fastq_path
|   |-- input_samplesheet.csv
|   `-- interop_path
|-- _perf
|-- _sitecheck
|-- _tags
|-- _timestamp
|-- tiny-bcl.mri.tgz
|-- _uuid
|-- _vdrkill
`-- _versions

6 directories, 16 files

The FASTQ output can be found in the outs/fastq_path/ directory (default setting).

tree --charset=ascii -L 3 tiny-bcl/outs/fastq_path/
tiny-bcl/outs/fastq_path/
|-- H35KCBCXY
|   `-- test_sample
|       |-- test_sample_S1_L001_I1_001.fastq.gz
|       |-- test_sample_S1_L001_R1_001.fastq.gz
|       `-- test_sample_S1_L001_R2_001.fastq.gz
|-- Reports
|   `-- html
|       `-- H35KCBCXY
|-- Stats
|   |-- AdapterTrimming.txt
|   |-- ConversionStats.xml
|   |-- DemultiplexingStats.xml
|   |-- DemuxSummaryF1L1.txt
|   |-- FastqSummaryF1L1.txt
|   `-- Stats.json
|-- Undetermined_S0_L001_I1_001.fastq.gz
|-- Undetermined_S0_L001_R1_001.fastq.gz
`-- Undetermined_S0_L001_R2_001.fastq.gz

6 directories, 12 files

The FASTQ files are named according to the sample column of the sample sheet. If a sample ID was not specified, the flow cell ID is used instead (not shown here). In addition to the FASTQ files, bcl2fastq generates various summary files. If --stats-dir was not specified, summary and statistic files will be stored in a Stats folder by default.

count

cellranger count quantifies single-cell gene expression. For this step, we'll use the PBMC FASTQ files (pbmc8k_fastqs.tar) we downloaded right at the start of the post. This dataset was generated from Peripheral blood mononuclear cells (PBMCs), which are primary cells with relatively small amounts of RNA (~1pg RNA/cell), from a healthy donor and sequenced on an Illumina HiSeq 4000. The first read pair is 26 bp and contains the cell barcode and the UMI sequence (16bp Chromium barcode and 10bp UMI), the second read pair is 98 bp and contains the cDNA sequence, and there's the 8 bp I7 sample barcodes.

We'll extract the reads and check them out!

tar -xf pbmc8k_fastqs.tar

ls -1 fastqs/
pbmc8k_S1_L007_I1_001.fastq.gz
pbmc8k_S1_L007_R1_001.fastq.gz
pbmc8k_S1_L007_R2_001.fastq.gz
pbmc8k_S1_L008_I1_001.fastq.gz
pbmc8k_S1_L008_R1_001.fastq.gz
pbmc8k_S1_L008_R2_001.fastq.gz

# read 1 contains the cell barcode and UMI tag
# note the sample index in the header
zcat fastqs/pbmc8k_S1_L007_R1_001.fastq.gz | head -8
@ST-K00126:314:HFYL2BBXX:7:1101:1631:1226 1:N:0:GTAATTGC
GGGCACTAGCTGATAAGGGCCCAACG
+
A-AFFJA-AAJ<FF-F<<F-7FJJJJ
@ST-K00126:314:HFYL2BBXX:7:1101:1834:1226 1:N:0:GTAATTGC
CCGTGGACAGTGACAGATAACGCTTT
+
AAFF-AJ-A-AFJJ-FFFJJJFJJJF

# read 2 contains cDNA sequence
# note the sample index in the header
zcat fastqs/pbmc8k_S1_L007_R2_001.fastq.gz | head -8
@ST-K00126:314:HFYL2BBXX:7:1101:1631:1226 2:N:0:GTAATTGC
GNTGTGGCAGAGCAGCGACCCGCGGCGGGGCGGCATCCCCAGCTGGTTCGGGCCGGGACGGGGCGGCCAGCAGGGACGCGCCCCAGGGGGGCAGCTGT
+
A#-<<F7<AJF-FJ<JAAJJFJJ<AF-7AJF77<FJJJFFFJJ<JA-7-777<-F7<<F--7AA7AAAFF-AF<A-AFFA7J7F--7)-)7--7A<J-
@ST-K00126:314:HFYL2BBXX:7:1101:1834:1226 2:N:0:GTAATTGC
CNATGAGTGAAAGGGAGCCAGAAGACTGATTGGAGGGCCCTAGCTTGTGAGTGCGGCATCTCTTGGACTTTCCACCTGGTCATATACTCTGCAGCTGT
+
A#<-<FFAFFJJJ--FAFJJFJJ77FJ<FJFJAJ--<77AJ<<FAA-AFFJ--7FFAFAAF-A-777A7AFJJ7-7-77---<F-<FJJJJAFJJJAA

# the index reads contain the sample index
zcat fastqs/pbmc8k_S1_L007_I1_001.fastq.gz | head -8
@ST-K00126:314:HFYL2BBXX:7:1101:1631:1226 1:N:0:GTAATTGC
GTAATTGC
+
AAAFFJFJ
@ST-K00126:314:HFYL2BBXX:7:1101:1834:1226 1:N:0:GTAATTGC
GTAATTGC
+
AAAFFJ<F

We can now run cellranger count on our FASTQ files. We will use the argument --cells=10000, which is the expected number of recovered cells. The pipeline will create a new directory based on the --id input; if this folder already exists, cellranger will assume it is an existing pipestance and attempt to resume running it. The --sample input needs to set to the name prefixed to our FASTQ files; in this run, I've set --id and --sample as pbmc8k. Other input arguments are self-explanatory.

cellranger count --id=pbmc8k \
                 --transcriptome=refdata-cellranger-GRCh38-1.2.0 \
                 --fastqs=fastqs/ \
                 --sample=pbmc8k \
                 --localcores=32 \
                 --expect-cells=10000 \
                 --localmem=200

...
Waiting 6 seconds for UI to do final refresh.
Pipestance completed successfully!

Saving pipestance info to pbmc8k/pbmc8k.mri.tgz

That took roughly 9.5 hours and the results are in the folder pbmc8k.

tree --charset=ascii -L 2 pbmc8k
pbmc8k
|-- _cmdline
|-- _filelist
|-- _finalstate
|-- _invocation
|-- _jobmode
|-- _log
|-- _mrosource
|-- outs
|   |-- analysis
|   |-- cloupe.cloupe
|   |-- filtered_gene_bc_matrices
|   |-- filtered_gene_bc_matrices_h5.h5
|   |-- metrics_summary.csv
|   |-- molecule_info.h5
|   |-- possorted_genome_bam.bam
|   |-- possorted_genome_bam.bam.bai
|   |-- raw_gene_bc_matrices
|   |-- raw_gene_bc_matrices_h5.h5
|   `-- web_summary.html
|-- pbmc8k.mri.tgz
|-- _perf
|-- _perf._truncated_
|-- SC_RNA_COUNTER_CS
|   |-- CLOUPE_PREPROCESS
|   |-- fork0
|   `-- SC_RNA_COUNTER
|-- _sitecheck
|-- _tags
|-- _timestamp
|-- _uuid
|-- _vdrkill
`-- _versions

8 directories, 24 files

Below is a description of the output directories and files.

File Name Description
web_summary.html Run summary metrics and charts in HTML format
metrics_summary.csv Run summary metrics in CSV format
possorted_genome_bam.bam Reads aligned to the genome and transcriptome annotated with barcode information
possorted_genome_bam.bam.bai Index for possorted_genome_bam.bam
filtered_gene_bc_matrices Filtered gene-barcode matrices containing only cellular barcodes in MEX format
filtered_gene_bc_matrices_h5.h5 Filtered gene-barcode matrices containing only cellular barcodes in HDF5 format
raw_gene_bc_matrices Unfiltered gene-barcode matrices containing all barcodes in MEX format
raw_gene_bc_matrices_h5.h5 Unfiltered gene-barcode matrices containing all barcodes in HDF5 format
analysis Secondary analysis data including dimensionality reduction, cell clustering, and differential expression
molecule_info.h5 Molecule-level information used by cellranger aggr to aggregate samples into larger datasets.
cloupe.cloupe Loupe Cell Browser visualization and analysis file

I've written a post describing the BAM file generated by cellranger count. The raw_gene_bc_matrices and filtered_gene_bc_matrices contain the gene expression profiles for all cell barcodes and filtered cell barcodes, respectively.

These matrices are available for download and we can compare our run results with the results available online.

# compare raw matrix
wget http://cf.10xgenomics.com/samples/cell-exp/2.1.0/pbmc8k/pbmc8k_raw_gene_bc_matrices.tar.gz
tar -xzf pbmc8k_raw_gene_bc_matrices.tar.gz

tree --charset=ascii raw_gene_bc_matrices/
raw_gene_bc_matrices/
`-- GRCh38
    |-- barcodes.tsv
    |-- genes.tsv
    `-- matrix.mtx

1 directory, 3 files

# checksum for downloaded results
md5sum raw_gene_bc_matrices/GRCh38/*
f010059d7160aac2dad1805e1496bf71  raw_gene_bc_matrices/GRCh38/barcodes.tsv
49255c8490781e36052f33ac8fbf6c23  raw_gene_bc_matrices/GRCh38/genes.tsv
8a277e197570347905c0151bd9563405  raw_gene_bc_matrices/GRCh38/matrix.mtx

# checksum for our run
md5sum outs/raw_gene_bc_matrices/GRCh38/*
f010059d7160aac2dad1805e1496bf71  outs/raw_gene_bc_matrices/GRCh38/barcodes.tsv
49255c8490781e36052f33ac8fbf6c23  outs/raw_gene_bc_matrices/GRCh38/genes.tsv
8a277e197570347905c0151bd9563405  outs/raw_gene_bc_matrices/GRCh38/matrix.mtx

Our raw matrix files are exactly the same as the ones online.

wget http://cf.10xgenomics.com/samples/cell-exp/2.1.0/pbmc8k/pbmc8k_filtered_gene_bc_matrices.tar.gz
tar -xzf pbmc8k_filtered_gene_bc_matrices.tar.gz

tree --charset=ascii filtered_gene_bc_matrices/
filtered_gene_bc_matrices/
`-- GRCh38
    |-- barcodes.tsv
    |-- genes.tsv
    `-- matrix.mtx

1 directory, 3 files

# checksum for downloaded results
md5sum filtered_gene_bc_matrices/GRCh38/*
1613a13e62858867c6af36206ec72c99  filtered_gene_bc_matrices/GRCh38/barcodes.tsv
49255c8490781e36052f33ac8fbf6c23  filtered_gene_bc_matrices/GRCh38/genes.tsv
27845c1e8de085782b25e13980351e96  filtered_gene_bc_matrices/GRCh38/matrix.mtx

# checksum for our run
md5sum outs/filtered_gene_bc_matrices/GRCh38/*
1613a13e62858867c6af36206ec72c99  outs/filtered_gene_bc_matrices/GRCh38/barcodes.tsv
49255c8490781e36052f33ac8fbf6c23  outs/filtered_gene_bc_matrices/GRCh38/genes.tsv
27845c1e8de085782b25e13980351e96  outs/filtered_gene_bc_matrices/GRCh38/matrix.mtx

Our filtered matrix files are exactly the same as well.

DropletUtils

The difference between the raw and the filtered gene matrices is that the raw matrices contain all cell barcodes and the filtered dataset only contains cell barcodes that Cell Ranger considers to be real, i.e. droplets that contain a cell. The total UMI of a cell barcode is used to rank the barcodes and this is plotted on a log scale (from highest to lowest) to determine the UMI threshold for filtering.

The plot above is generated by the cellranger count step; it can be view here. The assumption is that cell barcodes with high UMI counts contain real cells and those with low UMI counts don't. The drop-off point in the plot is used to determine the threshold.

An alternative method for determining whether droplets contain a cell or just ambient RNA (which some refer to as "soup") is implemented in the DropletUtils package. Briefly, the method takes all cell barcodes with a UMI count less than 100 (default) and generates a proportion value for each gene. Next, all cell barcodes with a UMI count higher than 100 are tested against this proportion vector; cell barcodes with a significant deviation from the ambient profile are considered droplets with real cells.

To get started, install the package.

install.packages("BiocManager")
BiocManager::install("DropletUtils")

The package uses the read10xCounts function to load 10x data and stores it as a SingleCellExperiment object. The argument is the folder containing the barcodes.tsv, genes.tsv, and matrix.mtx files.

library("DropletUtils")

sce <- read10xCounts("raw_gene_bc_matrices/GRCh38/")
sce

class: SingleCellExperiment 
dim: 33694 737280 
metadata(0):
assays(1): counts
rownames(33694): ENSG00000243485 ENSG00000237613 ... ENSG00000277475 ENSG00000268674
rowData names(2): ID Symbol
colnames: NULL
colData names(2): Sample Barcode
reducedDimNames(0):
spikeNames(0):

The barcodeRanks function is then used to rank the cell barcodes by its total UMI count; the function also calculates the drop-off points.

br.out <- barcodeRanks(counts(sce))

library("dplyr")
# I'll convert the list into a data frame
# and add the cell barcodes to the data frame
br.out.df <- as.data.frame(br.out)
br.out.df$barcode <- colData(sce)$Barcode
br.out.df %>%
  filter(rank <= 10) %>%
  arrange(rank)

   rank total fitted knee inflection            barcode
1     1 31785     NA 3091        605 GCAGTTAGTTCCATGA-1
2     2 26321     NA 3091        605 CCTTCCCTCATCTGTT-1
3     3 25925     NA 3091        605 AGATCTGGTGTGCGTC-1
4     4 24530     NA 3091        605 ACCCACTAGATCGATA-1
5     5 22883     NA 3091        605 GTTAAGCGTACCTACA-1
6     6 21744     NA 3091        605 GGCTCGAAGAGGTAGA-1
7     7 21446     NA 3091        605 GATCGCGTCCTGCTTG-1
8     8 21086     NA 3091        605 TGCGGGTGTACCAGTT-1
9     9 21005     NA 3091        605 AGGTCCGAGCTAACTC-1
10   10 20279     NA 3091        605 GTGTGCGCATGCCACG-1

We can produce a barcode rank vs. total UMI plot similar to the one in the web summary report.

library(ggplot2)
x_knee <- br.out.df %>% filter(total > br.out$knee) %>% arrange(total) %>% select(rank) %>% head(1)
x_inflection <- br.out.df %>% filter(total > br.out$inflection) %>% arrange(total) %>% select(rank) %>% head(1)
padding <- length(br.out$rank) / 10

ggplot(br.out.df, aes(x = rank, y = total)) +
  geom_point() +
  scale_x_log10() +
  scale_y_log10() +
  theme_bw() +
  theme(axis.text = element_text(size = 10),
        axis.title = element_text(size = 14),
        title = element_text(size = 16)) +
  geom_hline(yintercept = br.out$knee, linetype = 2, colour = "dodgerblue") +
  geom_hline(yintercept = br.out$inflection, linetype = 2, colour = "forestgreen") +
  labs(x = "Rank", y = "Total", title = "Barcode Rank vs Total UMI") +
  annotate("text", label = paste0("Knee (", x_knee, ")"), x = x_knee$rank + padding, y = br.out$knee, size = 5) +
  annotate("text", label = paste0("Inflection (", x_inflection, ")"), x = x_inflection$rank + padding, y = br.out$inflection, size = 5)

If we filter at a total UMI count of 3091, we get 6,705 cell barcodes; at a total UMI of 605, we get 8,719 cell barcodes.

Now to test whether cell barcodes with >100 total UMIs have the same gene profile as "ambient" cell barcodes.

set.seed(100)
e.out <- emptyDrops(counts(sce))

# use FDR threshold of 0.01
is.cell <- e.out$FDR <= 0.01

# replace all NA's with FALSE
is.cell.no.na <- is.cell
is.cell.no.na[is.na(is.cell)] <- FALSE
sum(is.cell.no.na)
[1] 8482

ggplot(br.out.df[is.cell.no.na,], aes(x = rank, y = total)) +
  geom_point() +
  scale_x_log10() +
  scale_y_log10() +
  theme_bw() +
  theme(axis.text = element_text(size = 10),
        axis.title = element_text(size = 14),
        title = element_text(size = 16)) +
  labs(x = "Rank", y = "Total", title = "Cell barcodes that differ from ambient RNA")

Cell barcodes that likely contain a real cell. Note that some low UMI cell barcodes are considered to contain cells.

We can compare this to what Cell Ranger considers real cell droplets by loading the filtered results; you need the readr and gplots packages.

my_filtered_bc <- readr::read_tsv("filtered_gene_bc_matrices/GRCh38/barcodes.tsv", col_names = "barcode")
gplots::venn(list(DropletUtils = br.out.df$barcode[is.cell.no.na], "Cell Ranger" = my_filtered_bc$barcode))

There's a proportion of cell barcodes with low UMIs that differ enough from the ambient profile that DropletUtils considers to contain a cell.

Creating your own reference

The human reference (GRCh38) dataset provided by 10x Genomics contains the FASTA sequence of hg38, a GTF file of genes, a pickle file (not sure what that is), and index files for STAR.

tree --charset=ascii refdata-cellranger-GRCh38-1.2.0
refdata-cellranger-GRCh38-1.2.0
|-- fasta
|   `-- genome.fa
|-- genes
|   `-- genes.gtf
|-- pickle
|   `-- genes.pickle
|-- README.BEFORE.MODIFYING
|-- reference.json
|-- star
|   |-- chrLength.txt
|   |-- chrNameLength.txt
|   |-- chrName.txt
|   |-- chrStart.txt
|   |-- exonGeTrInfo.tab
|   |-- exonInfo.tab
|   |-- geneInfo.tab
|   |-- Genome
|   |-- genomeParameters.txt
|   |-- SA
|   |-- SAindex
|   |-- sjdbInfo.txt
|   |-- sjdbList.fromGTF.out.tab
|   |-- sjdbList.out.tab
|   `-- transcriptInfo.tab
`-- version

4 directories, 21 files

However, if you are not working with human or mouse data, you will need to create your own reference files. Or you may be working with nuclei data, which contains a nascent mRNAs that are unspliced and contain introns; cellranger count will only count reads mapping to exons. Cell Ranger provides two utilities for creating your own references: cellranger mkgtf and cellranger mkref. You require a reference genome sequence (FASTA) and gene annotations (GTF). The guide is straightforward enough to follow though you will need to decide which biotypes to include in your GTF file.

Summary

Cell Ranger is the software provided by 10x Genomics to process Chromium single cell 3' RNA-seq data. There are four pipelines included in the software package but this post only described the first two. One issue with droplet based scRNA-seq methods is whether a droplet contains a cell (or even multiple cells, which should be another post) or just ambient RNA. DropletUtils is a package that can be used to determine whether droplets contained a cell.




Creative Commons License
This work is licensed under a Creative Commons
Attribution 4.0 International License
.
18 comments Add yours
  1. Thank you for the wonderful tutorial. I wonder what “S1_L001_I1_001” is the fastq file represents? Thanks

  2. Hi,
    Love your blog, it is very helpful. I was wondering where to find the information about what R1, R2 and I1 files are after making FASTQs.
    Thank you,
    Dan

  3. Hey can you tell what will be the ‘–sample’ if the fastqs are not generated by cellranger and do not follow the naming convention of cellranger?

  4. Hi Dave, this is very helpful thank you. Can you give me an idea of how long this whole analysis will take? I am just asking because I need to reserve some time on a shared computer to run this pipeline and wanted to know how much time I need to put aside. Thanks!

    1. Hi Esther,

      the count step takes the longest because of read mapping. Factors that affect the amount of computing time you need to reserve:

      * How many reads are you trying to process? And what reference genome are you mapping to?
      * How many CPUs (cores) do you have?
      * Do you have access to “scratch” space, which provides faster data access?

      I can’t estimate how long but it could take up to 24 hours; try it on your own dataset and see how long it takes.

      Cheers,
      Dave

      1. 10x Genomics have graphs of how long steps took based on how much memory was available:
        support.10xgenomics.com/single-cell-gene-expression/software/overview/system-requirements

  5. Any idea how to make it run flawlessly using slurum?

    I tried one sample (deeply sequenced) with 20 cores and 690GB RAM ,but its shutting down after 1.5 h with following error

    2020-02-05 14:27:06 [jobmngr] Highest memory usage observed: {
    “rss”: 14946304,
    “shared”: 3981312,
    “vmem”: 339881984,
    “text”: 4599808,
    “stack”: 319549440,
    “proc_count”: 51
    }

  6. Thank you for the tutorial, its very helpful.
    I have a question about cellranger aggr and downsample normalization between batches.
    My batches vary significantly in number of reads.
    Is it considered standard to use the normalization=mapped flag when using aggr?
    If this downsampling removes too many reads is using batch correction + Seurat normalization
    sufficient in most cases?

    1. Hi Michael,

      I have not used the down-sample normalisation functionality of Cell Ranger (I primarily used Cell Ranger to prepare the raw expression matrix) but have used Seurat for normalisation and integration to deal with batch effects across different experiments. Take a look at https://satijalab.org/seurat/v3.2/integration.html and pick an approach that works with your dataset.

      Cheers,
      Dave

  7. Hi, I want to ask you one question is that if i only have bam file (cellranger count make),how can i get the orignal fastq ( the author didn’t provide fastq file)

  8. Hi,

    I am new at this field. Is there any chance to run cell ranger pipeline on a pc with 16 gb ram?

  9. Hi Dave,

    I have a question about where to install bcl2fastq. I followed 10x instructions (https://kb.10xgenomics.com/hc/en-us/articles/360001618231-How-to-troubleshoot-installing-bcl2fastq-?src=distributor&lss=website&cnm=disti-web-website-page-ra_g-apac&cid=7011P000000oPEK) to download it but I don’t think I’m placing the executable in the correct folder. When I try to run analysis, I get an error saying bcl2fastq cannot be located. Please help as I am new to this.

    Thanks!

    Yuliya

    1. Hi Yuliya,

      you can place bcl2fastq anywhere but you need to add that location to your PATH. Do you know what shell you are using? If you type “echo $SHELL” in the terminal it should tell you what shell you are using. If you are using Bash and if the location of bcl2fastq is in /home/yuliya/bin/ (for example), you can add that path to your PATH by running:

      export PATH=/home/yuliya/bin:$PATH
      

      Then when you type “which bcl2fastq” it should show /home/yuliya/bin/bcl2fastq. To avoid typing that each time, you can add that export line to your .bashrc file.

      Check out this article too if you are new to all this: https://www.howtogeek.com/658904/how-to-add-a-directory-to-your-path-in-linux/.

      Hope that helps,
      Dave

  10. Hi Dave,

    this tutorial is really helpful Thank you for that. I am wondering which Cell Ranger version you used? I am working with dual index libraries and I used cellranger count transcriptome version 6.0.1. I am wondering does Cell Ranger uses both index files (I1 and I2 ( sample1_S1_L001_I1_001.fastq.gz and sample1_S1_L001_I2_001.fastq.gz)) to calculate such quality metrics as “Q30 Bases in Sample Index”? Do you know how Cell Ranger handles such things? I mean will it know when there are 2 index files that it is dual index? Do you know what I mean? I also read that you do not need I1 and I2 FASTQ files for running Cell Ranger if the barcode info is within the R1 file like by using the 3’v2+ version chemistry. But how will it calculate the Q30 for the sample index then? Or is this optional?

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.