Cell Ranger is a set of analysis pipelines that process Chromium single cell 3' RNA-seq data. The pipelines process raw sequencing output, performs read alignment, generate gene-cell matrices, and can perform downstream analyses such as clustering and gene expression analysis. Cell Ranger includes four pipelines:
- cellranger mkfastq
- cellranger count
- cellranger aggr
- cellranger reanalyze
You can download Cell Ranger from their software download page. Conveniently, Cell Ranger is provided as a single self-contained file that bundles all its own software dependencies. You can view the source at their GitHub repository.
This post was started almost a year ago and it's still missing the cellranger aggr and cellranger reanalyze steps. I've decided to post it anyway, since I haven't had a post for two months.
Downloading data
First, start downloading the FASTQ files (73.61 GB) that we will use later in the post; they are quite large and depending on your Internet speed, may take up to several hours.
wget -c -N http://s3-us-west-2.amazonaws.com/10x.files/samples/cell-exp/2.1.0/pbmc8k/pbmc8k_fastqs.tar
Next, visit the download page to generate your own download link for the Cell Ranger tarball.
wget -O cellranger-2.2.0.tar.gz "http://cf.10xgenomics.com/releases/cell-exp/cellranger-2.2.0.tar.gz?Expires=1533651784&Policy=eyJTdGF0ZW1lbnQiOlt7IlJlc291cmNlIjoiaHR0cDovL2NmLjEweGdlbm9taWNzLmNvbS9yZWxlYXNlcy9jZWxsLWV4cC9jZWxscmFuZ2VyLTIuMi4wLnRhci5neiIsIkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTUzMzY1MTc4NH19fV19&Signature=L5k81QJRjubjoBwUeEcLTDT3uuuixkX4ZEJsNqO2ZLfe~cM2ygqHAJBHQ00xJxAdNPOyY8vu0RBaApGZM1H0-OqdqWg6rK4gdO6EF8lJu~oWv97X7zw6EfhVwISUFyoK22WFYR-MAKZHwxiIn8C3vORnZcxxUuN~sq0guVZ5ievI5iFYrHRhumQQhkQr~2fkbQDp3Bd5wtbyflfwMnE7fpdgy~jTSIbMLbHUICnM3-Gc~flroyEcoyugenrcSHNAcO~F0l~gmOewZnqqbEshxHBU4-faJ1iBixhJBlGHVHagqqpq6xyktVdAEIbX6uM~zawmHgi3axDh4HXffwMNiw__&Key-Pair-Id=APKAI7S6A5RYOXBWRPDA"
Download the hg38 reference tarball.
wget http://cf.10xgenomics.com/supp/cell-exp/refdata-cellranger-GRCh38-1.2.0.tar.gz
Download the BCL tarball and sample sheet.
wget http://cf.10xgenomics.com/supp/cell-exp/cellranger-tiny-bcl-1.2.0.tar.gz wget http://cf.10xgenomics.com/supp/cell-exp/cellranger-tiny-bcl-simple-1.2.0.csv
Test run
Cell Ranger can perform a test run to verify the installation. We'll extract the tarball and perform the test run. Follow the official installation instructions for setting up Cell Ranger.
tar -xzf cellranger-2.2.0.tar.gz # test run cellranger testrun --id=testrun \ --localcores=32 \ --localmem=200 # if everything worked you should see the following Pipestance completed successfully! Saving pipestance info to testrun/testrun.mri.tgz
Note that Cell Ranger runs locally and will use 90% of available memory and cores by default. To restrict resource usage, use the command line arguments --localmem and --localcores, as above.
--localcores=NUM Set max cores the pipeline may request at one time. Only applies when --jobmode=local. --localmem=NUM Set max GB the pipeline may request at one time. Only applies when --jobmode=local.
Each core that Cell Ranger uses will spawn 64 user processes, so you may run into problems if your system has a limit on the max user processes. You can use ulimit -u to find out the limit.
ulimit -u 4096
If you have a limit of 4096, set --localcores to 64 or less since 64 * 64 = 4096.
mkfastq
Illumina sequencing instruments generate per-cycle raw base call (BCL) files as primary sequencing output. The cellranger mkfastq pipeline performs the demultiplexing and conversion step for BCL files for each flowcell directory.
This step requires Illumina's bcl2fastq tool, which needs to be set up prior to running mkfastq. I have compiled v2.20.0.422 in my bin directory, which has been added to my PATH.
We have already downloaded some sample BCL files in the download section; let's check it out.
tar -xzf cellranger-tiny-bcl-1.2.0.tar.gz tree --charset=ascii -L 3 cellranger-tiny-bcl-1.2.0 cellranger-tiny-bcl-1.2.0 |-- Config | |-- h35kcbcxy_Oct-20-16\ 15-51-48_Effective.cfg | |-- HiSeqControlSoftware.Options.cfg | |-- RTAStart.bat | `-- Variability_HiSeq_C.bin |-- Data | `-- Intensities | |-- BaseCalls | |-- config.xml | |-- L001 | |-- Offsets | `-- RTAConfiguration.xml |-- InterOp | |-- ControlMetricsOut.bin | |-- CorrectedIntMetricsOut.bin | |-- ErrorMetricsOut.bin | |-- ExtractionMetricsOut.bin | |-- ImageMetricsOut.bin | |-- QMetricsOut.bin | `-- TileMetricsOut.bin |-- RTAComplete.txt |-- RunInfo.xml `-- runParameters.xml 7 directories, 16 files
This is the typical directory structure generated by Illumina systems; the image below is for MiniSeq or NextSeq systems but other systems are quite similar.
Figure from the bcl2fastq2 Conversion Software v2.19 User Guide.
A sample sheet directs the mkfastq step on how to assign reads to samples and samples to projects. Let's check out the example sample sheet we downloaded.
cat cellranger-tiny-bcl-simple-1.2.0.csv Lane,Sample,Index 1,test_sample,SI-P03-C9
This is the simplest case where we have one lane of sequencing with one sample (test_sample) that has one sample index. Multiplexed sequencing allows multiple samples on one lane by introducing index sequences during sample preparation.
Let's run the mkfastq step; adjust the location of the directory and files according, since you may have put them in different locations.
cellranger mkfastq --id=tiny-bcl \ --localcores=32 \ --localmem=200 \ --run=cellranger-tiny-bcl-1.2.0 \ --csv=cellranger-tiny-bcl-simple-1.2.0.csv # not all output is shown ... Outputs: - Run QC metrics: null - FASTQ output folder: /home/dtang/fd/test/tiny-bcl/outs/fastq_path - Interop output folder: /home/dtang/fd/test/tiny-bcl/outs/interop_path - Input samplesheet: /home/dtang/fd/test/tiny-bcl/outs/input_samplesheet.csv Waiting 6 seconds for UI to do final refresh. Pipestance completed successfully! Saving pipestance info to tiny-bcl/tiny-bcl.mri.tgz
The tiny-bcl folder contains several directories and files.
tree --charset=ascii -L 2 tiny-bcl/ tiny-bcl/ |-- _cmdline |-- _filelist |-- _finalstate |-- _invocation |-- _jobmode |-- _log |-- MAKE_FASTQS_CS | |-- fork0 | `-- MAKE_FASTQS |-- _mrosource |-- outs | |-- fastq_path | |-- input_samplesheet.csv | `-- interop_path |-- _perf |-- _sitecheck |-- _tags |-- _timestamp |-- tiny-bcl.mri.tgz |-- _uuid |-- _vdrkill `-- _versions 6 directories, 16 files
The FASTQ output can be found in the outs/fastq_path/ directory (default setting).
tree --charset=ascii -L 3 tiny-bcl/outs/fastq_path/ tiny-bcl/outs/fastq_path/ |-- H35KCBCXY | `-- test_sample | |-- test_sample_S1_L001_I1_001.fastq.gz | |-- test_sample_S1_L001_R1_001.fastq.gz | `-- test_sample_S1_L001_R2_001.fastq.gz |-- Reports | `-- html | `-- H35KCBCXY |-- Stats | |-- AdapterTrimming.txt | |-- ConversionStats.xml | |-- DemultiplexingStats.xml | |-- DemuxSummaryF1L1.txt | |-- FastqSummaryF1L1.txt | `-- Stats.json |-- Undetermined_S0_L001_I1_001.fastq.gz |-- Undetermined_S0_L001_R1_001.fastq.gz `-- Undetermined_S0_L001_R2_001.fastq.gz 6 directories, 12 files
The FASTQ files are named according to the sample column of the sample sheet. If a sample ID was not specified, the flow cell ID is used instead (not shown here). In addition to the FASTQ files, bcl2fastq generates various summary files. If --stats-dir was not specified, summary and statistic files will be stored in a Stats folder by default.
count
cellranger count quantifies single-cell gene expression. For this step, we'll use the PBMC FASTQ files (pbmc8k_fastqs.tar) we downloaded right at the start of the post. This dataset was generated from Peripheral blood mononuclear cells (PBMCs), which are primary cells with relatively small amounts of RNA (~1pg RNA/cell), from a healthy donor and sequenced on an Illumina HiSeq 4000. The first read pair is 26 bp and contains the cell barcode and the UMI sequence (16bp Chromium barcode and 10bp UMI), the second read pair is 98 bp and contains the cDNA sequence, and there's the 8 bp I7 sample barcodes.
We'll extract the reads and check them out!
tar -xf pbmc8k_fastqs.tar ls -1 fastqs/ pbmc8k_S1_L007_I1_001.fastq.gz pbmc8k_S1_L007_R1_001.fastq.gz pbmc8k_S1_L007_R2_001.fastq.gz pbmc8k_S1_L008_I1_001.fastq.gz pbmc8k_S1_L008_R1_001.fastq.gz pbmc8k_S1_L008_R2_001.fastq.gz # read 1 contains the cell barcode and UMI tag # note the sample index in the header zcat fastqs/pbmc8k_S1_L007_R1_001.fastq.gz | head -8 @ST-K00126:314:HFYL2BBXX:7:1101:1631:1226 1:N:0:GTAATTGC GGGCACTAGCTGATAAGGGCCCAACG + A-AFFJA-AAJ<FF-F<<F-7FJJJJ @ST-K00126:314:HFYL2BBXX:7:1101:1834:1226 1:N:0:GTAATTGC CCGTGGACAGTGACAGATAACGCTTT + AAFF-AJ-A-AFJJ-FFFJJJFJJJF # read 2 contains cDNA sequence # note the sample index in the header zcat fastqs/pbmc8k_S1_L007_R2_001.fastq.gz | head -8 @ST-K00126:314:HFYL2BBXX:7:1101:1631:1226 2:N:0:GTAATTGC GNTGTGGCAGAGCAGCGACCCGCGGCGGGGCGGCATCCCCAGCTGGTTCGGGCCGGGACGGGGCGGCCAGCAGGGACGCGCCCCAGGGGGGCAGCTGT + A#-<<F7<AJF-FJ<JAAJJFJJ<AF-7AJF77<FJJJFFFJJ<JA-7-777<-F7<<F--7AA7AAAFF-AF<A-AFFA7J7F--7)-)7--7A<J- @ST-K00126:314:HFYL2BBXX:7:1101:1834:1226 2:N:0:GTAATTGC CNATGAGTGAAAGGGAGCCAGAAGACTGATTGGAGGGCCCTAGCTTGTGAGTGCGGCATCTCTTGGACTTTCCACCTGGTCATATACTCTGCAGCTGT + A#<-<FFAFFJJJ--FAFJJFJJ77FJ<FJFJAJ--<77AJ<<FAA-AFFJ--7FFAFAAF-A-777A7AFJJ7-7-77---<F-<FJJJJAFJJJAA # the index reads contain the sample index zcat fastqs/pbmc8k_S1_L007_I1_001.fastq.gz | head -8 @ST-K00126:314:HFYL2BBXX:7:1101:1631:1226 1:N:0:GTAATTGC GTAATTGC + AAAFFJFJ @ST-K00126:314:HFYL2BBXX:7:1101:1834:1226 1:N:0:GTAATTGC GTAATTGC + AAAFFJ<F
We can now run cellranger count on our FASTQ files. We will use the argument --cells=10000, which is the expected number of recovered cells. The pipeline will create a new directory based on the --id input; if this folder already exists, cellranger will assume it is an existing pipestance and attempt to resume running it. The --sample input needs to set to the name prefixed to our FASTQ files; in this run, I've set --id and --sample as pbmc8k. Other input arguments are self-explanatory.
cellranger count --id=pbmc8k \ --transcriptome=refdata-cellranger-GRCh38-1.2.0 \ --fastqs=fastqs/ \ --sample=pbmc8k \ --localcores=32 \ --expect-cells=10000 \ --localmem=200 ... Waiting 6 seconds for UI to do final refresh. Pipestance completed successfully! Saving pipestance info to pbmc8k/pbmc8k.mri.tgz
That took roughly 9.5 hours and the results are in the folder pbmc8k.
tree --charset=ascii -L 2 pbmc8k pbmc8k |-- _cmdline |-- _filelist |-- _finalstate |-- _invocation |-- _jobmode |-- _log |-- _mrosource |-- outs | |-- analysis | |-- cloupe.cloupe | |-- filtered_gene_bc_matrices | |-- filtered_gene_bc_matrices_h5.h5 | |-- metrics_summary.csv | |-- molecule_info.h5 | |-- possorted_genome_bam.bam | |-- possorted_genome_bam.bam.bai | |-- raw_gene_bc_matrices | |-- raw_gene_bc_matrices_h5.h5 | `-- web_summary.html |-- pbmc8k.mri.tgz |-- _perf |-- _perf._truncated_ |-- SC_RNA_COUNTER_CS | |-- CLOUPE_PREPROCESS | |-- fork0 | `-- SC_RNA_COUNTER |-- _sitecheck |-- _tags |-- _timestamp |-- _uuid |-- _vdrkill `-- _versions 8 directories, 24 files
Below is a description of the output directories and files.
File Name | Description |
---|---|
web_summary.html | Run summary metrics and charts in HTML format |
metrics_summary.csv | Run summary metrics in CSV format |
possorted_genome_bam.bam | Reads aligned to the genome and transcriptome annotated with barcode information |
possorted_genome_bam.bam.bai | Index for possorted_genome_bam.bam |
filtered_gene_bc_matrices | Filtered gene-barcode matrices containing only cellular barcodes in MEX format |
filtered_gene_bc_matrices_h5.h5 | Filtered gene-barcode matrices containing only cellular barcodes in HDF5 format |
raw_gene_bc_matrices | Unfiltered gene-barcode matrices containing all barcodes in MEX format |
raw_gene_bc_matrices_h5.h5 | Unfiltered gene-barcode matrices containing all barcodes in HDF5 format |
analysis | Secondary analysis data including dimensionality reduction, cell clustering, and differential expression |
molecule_info.h5 | Molecule-level information used by cellranger aggr to aggregate samples into larger datasets. |
cloupe.cloupe | Loupe Cell Browser visualization and analysis file |
I've written a post describing the BAM file generated by cellranger count. The raw_gene_bc_matrices and filtered_gene_bc_matrices contain the gene expression profiles for all cell barcodes and filtered cell barcodes, respectively.
These matrices are available for download and we can compare our run results with the results available online.
# compare raw matrix wget http://cf.10xgenomics.com/samples/cell-exp/2.1.0/pbmc8k/pbmc8k_raw_gene_bc_matrices.tar.gz tar -xzf pbmc8k_raw_gene_bc_matrices.tar.gz tree --charset=ascii raw_gene_bc_matrices/ raw_gene_bc_matrices/ `-- GRCh38 |-- barcodes.tsv |-- genes.tsv `-- matrix.mtx 1 directory, 3 files # checksum for downloaded results md5sum raw_gene_bc_matrices/GRCh38/* f010059d7160aac2dad1805e1496bf71 raw_gene_bc_matrices/GRCh38/barcodes.tsv 49255c8490781e36052f33ac8fbf6c23 raw_gene_bc_matrices/GRCh38/genes.tsv 8a277e197570347905c0151bd9563405 raw_gene_bc_matrices/GRCh38/matrix.mtx # checksum for our run md5sum outs/raw_gene_bc_matrices/GRCh38/* f010059d7160aac2dad1805e1496bf71 outs/raw_gene_bc_matrices/GRCh38/barcodes.tsv 49255c8490781e36052f33ac8fbf6c23 outs/raw_gene_bc_matrices/GRCh38/genes.tsv 8a277e197570347905c0151bd9563405 outs/raw_gene_bc_matrices/GRCh38/matrix.mtx
Our raw matrix files are exactly the same as the ones online.
wget http://cf.10xgenomics.com/samples/cell-exp/2.1.0/pbmc8k/pbmc8k_filtered_gene_bc_matrices.tar.gz tar -xzf pbmc8k_filtered_gene_bc_matrices.tar.gz tree --charset=ascii filtered_gene_bc_matrices/ filtered_gene_bc_matrices/ `-- GRCh38 |-- barcodes.tsv |-- genes.tsv `-- matrix.mtx 1 directory, 3 files # checksum for downloaded results md5sum filtered_gene_bc_matrices/GRCh38/* 1613a13e62858867c6af36206ec72c99 filtered_gene_bc_matrices/GRCh38/barcodes.tsv 49255c8490781e36052f33ac8fbf6c23 filtered_gene_bc_matrices/GRCh38/genes.tsv 27845c1e8de085782b25e13980351e96 filtered_gene_bc_matrices/GRCh38/matrix.mtx # checksum for our run md5sum outs/filtered_gene_bc_matrices/GRCh38/* 1613a13e62858867c6af36206ec72c99 outs/filtered_gene_bc_matrices/GRCh38/barcodes.tsv 49255c8490781e36052f33ac8fbf6c23 outs/filtered_gene_bc_matrices/GRCh38/genes.tsv 27845c1e8de085782b25e13980351e96 outs/filtered_gene_bc_matrices/GRCh38/matrix.mtx
Our filtered matrix files are exactly the same as well.
DropletUtils
The difference between the raw and the filtered gene matrices is that the raw matrices contain all cell barcodes and the filtered dataset only contains cell barcodes that Cell Ranger considers to be real, i.e. droplets that contain a cell. The total UMI of a cell barcode is used to rank the barcodes and this is plotted on a log scale (from highest to lowest) to determine the UMI threshold for filtering.
The plot above is generated by the cellranger count step; it can be view here. The assumption is that cell barcodes with high UMI counts contain real cells and those with low UMI counts don't. The drop-off point in the plot is used to determine the threshold.
An alternative method for determining whether droplets contain a cell or just ambient RNA (which some refer to as "soup") is implemented in the DropletUtils package. Briefly, the method takes all cell barcodes with a UMI count less than 100 (default) and generates a proportion value for each gene. Next, all cell barcodes with a UMI count higher than 100 are tested against this proportion vector; cell barcodes with a significant deviation from the ambient profile are considered droplets with real cells.
To get started, install the package.
install.packages("BiocManager") BiocManager::install("DropletUtils")
The package uses the read10xCounts function to load 10x data and stores it as a SingleCellExperiment object. The argument is the folder containing the barcodes.tsv, genes.tsv, and matrix.mtx files.
library("DropletUtils") sce <- read10xCounts("raw_gene_bc_matrices/GRCh38/") sce class: SingleCellExperiment dim: 33694 737280 metadata(0): assays(1): counts rownames(33694): ENSG00000243485 ENSG00000237613 ... ENSG00000277475 ENSG00000268674 rowData names(2): ID Symbol colnames: NULL colData names(2): Sample Barcode reducedDimNames(0): spikeNames(0):
The barcodeRanks function is then used to rank the cell barcodes by its total UMI count; the function also calculates the drop-off points.
br.out <- barcodeRanks(counts(sce)) library("dplyr") # I'll convert the list into a data frame # and add the cell barcodes to the data frame br.out.df <- as.data.frame(br.out) br.out.df$barcode <- colData(sce)$Barcode br.out.df %>% filter(rank <= 10) %>% arrange(rank) rank total fitted knee inflection barcode 1 1 31785 NA 3091 605 GCAGTTAGTTCCATGA-1 2 2 26321 NA 3091 605 CCTTCCCTCATCTGTT-1 3 3 25925 NA 3091 605 AGATCTGGTGTGCGTC-1 4 4 24530 NA 3091 605 ACCCACTAGATCGATA-1 5 5 22883 NA 3091 605 GTTAAGCGTACCTACA-1 6 6 21744 NA 3091 605 GGCTCGAAGAGGTAGA-1 7 7 21446 NA 3091 605 GATCGCGTCCTGCTTG-1 8 8 21086 NA 3091 605 TGCGGGTGTACCAGTT-1 9 9 21005 NA 3091 605 AGGTCCGAGCTAACTC-1 10 10 20279 NA 3091 605 GTGTGCGCATGCCACG-1
We can produce a barcode rank vs. total UMI plot similar to the one in the web summary report.
library(ggplot2) x_knee <- br.out.df %>% filter(total > br.out$knee) %>% arrange(total) %>% select(rank) %>% head(1) x_inflection <- br.out.df %>% filter(total > br.out$inflection) %>% arrange(total) %>% select(rank) %>% head(1) padding <- length(br.out$rank) / 10 ggplot(br.out.df, aes(x = rank, y = total)) + geom_point() + scale_x_log10() + scale_y_log10() + theme_bw() + theme(axis.text = element_text(size = 10), axis.title = element_text(size = 14), title = element_text(size = 16)) + geom_hline(yintercept = br.out$knee, linetype = 2, colour = "dodgerblue") + geom_hline(yintercept = br.out$inflection, linetype = 2, colour = "forestgreen") + labs(x = "Rank", y = "Total", title = "Barcode Rank vs Total UMI") + annotate("text", label = paste0("Knee (", x_knee, ")"), x = x_knee$rank + padding, y = br.out$knee, size = 5) + annotate("text", label = paste0("Inflection (", x_inflection, ")"), x = x_inflection$rank + padding, y = br.out$inflection, size = 5)
If we filter at a total UMI count of 3091, we get 6,705 cell barcodes; at a total UMI of 605, we get 8,719 cell barcodes.
Now to test whether cell barcodes with >100 total UMIs have the same gene profile as "ambient" cell barcodes.
set.seed(100) e.out <- emptyDrops(counts(sce)) # use FDR threshold of 0.01 is.cell <- e.out$FDR <= 0.01 # replace all NA's with FALSE is.cell.no.na <- is.cell is.cell.no.na[is.na(is.cell)] <- FALSE sum(is.cell.no.na) [1] 8482 ggplot(br.out.df[is.cell.no.na,], aes(x = rank, y = total)) + geom_point() + scale_x_log10() + scale_y_log10() + theme_bw() + theme(axis.text = element_text(size = 10), axis.title = element_text(size = 14), title = element_text(size = 16)) + labs(x = "Rank", y = "Total", title = "Cell barcodes that differ from ambient RNA")
Cell barcodes that likely contain a real cell. Note that some low UMI cell barcodes are considered to contain cells.
We can compare this to what Cell Ranger considers real cell droplets by loading the filtered results; you need the readr and gplots packages.
my_filtered_bc <- readr::read_tsv("filtered_gene_bc_matrices/GRCh38/barcodes.tsv", col_names = "barcode") gplots::venn(list(DropletUtils = br.out.df$barcode[is.cell.no.na], "Cell Ranger" = my_filtered_bc$barcode))
There's a proportion of cell barcodes with low UMIs that differ enough from the ambient profile that DropletUtils considers to contain a cell.
Creating your own reference
The human reference (GRCh38) dataset provided by 10x Genomics contains the FASTA sequence of hg38, a GTF file of genes, a pickle file (not sure what that is), and index files for STAR.
tree --charset=ascii refdata-cellranger-GRCh38-1.2.0 refdata-cellranger-GRCh38-1.2.0 |-- fasta | `-- genome.fa |-- genes | `-- genes.gtf |-- pickle | `-- genes.pickle |-- README.BEFORE.MODIFYING |-- reference.json |-- star | |-- chrLength.txt | |-- chrNameLength.txt | |-- chrName.txt | |-- chrStart.txt | |-- exonGeTrInfo.tab | |-- exonInfo.tab | |-- geneInfo.tab | |-- Genome | |-- genomeParameters.txt | |-- SA | |-- SAindex | |-- sjdbInfo.txt | |-- sjdbList.fromGTF.out.tab | |-- sjdbList.out.tab | `-- transcriptInfo.tab `-- version 4 directories, 21 files
However, if you are not working with human or mouse data, you will need to create your own reference files. Or you may be working with nuclei data, which contains a nascent mRNAs that are unspliced and contain introns; cellranger count will only count reads mapping to exons. Cell Ranger provides two utilities for creating your own references: cellranger mkgtf and cellranger mkref. You require a reference genome sequence (FASTA) and gene annotations (GTF). The guide is straightforward enough to follow though you will need to decide which biotypes to include in your GTF file.
Summary
Cell Ranger is the software provided by 10x Genomics to process Chromium single cell 3' RNA-seq data. There are four pipelines included in the software package but this post only described the first two. One issue with droplet based scRNA-seq methods is whether a droplet contains a cell (or even multiple cells, which should be another post) or just ambient RNA. DropletUtils is a package that can be used to determine whether droplets contained a cell.

This work is licensed under a Creative Commons
Attribution 4.0 International License.
Thank you for the wonderful tutorial. I wonder what “S1_L001_I1_001” is the fastq file represents? Thanks
Mapping file. Check https://groups.google.com/forum/#!topic/qiime-forum/sNYVTSVImqo
Hi,
Love your blog, it is very helpful. I was wondering where to find the information about what R1, R2 and I1 files are after making FASTQs.
Thank you,
Dan
Hey can you tell what will be the ‘–sample’ if the fastqs are not generated by cellranger and do not follow the naming convention of cellranger?
Hi Dave, this is very helpful thank you. Can you give me an idea of how long this whole analysis will take? I am just asking because I need to reserve some time on a shared computer to run this pipeline and wanted to know how much time I need to put aside. Thanks!
Hi Esther,
the count step takes the longest because of read mapping. Factors that affect the amount of computing time you need to reserve:
* How many reads are you trying to process? And what reference genome are you mapping to?
* How many CPUs (cores) do you have?
* Do you have access to “scratch” space, which provides faster data access?
I can’t estimate how long but it could take up to 24 hours; try it on your own dataset and see how long it takes.
Cheers,
Dave
10x Genomics have graphs of how long steps took based on how much memory was available:
support.10xgenomics.com/single-cell-gene-expression/software/overview/system-requirements
Any idea how to make it run flawlessly using slurum?
I tried one sample (deeply sequenced) with 20 cores and 690GB RAM ,but its shutting down after 1.5 h with following error
2020-02-05 14:27:06 [jobmngr] Highest memory usage observed: {
“rss”: 14946304,
“shared”: 3981312,
“vmem”: 339881984,
“text”: 4599808,
“stack”: 319549440,
“proc_count”: 51
}
Thank you for the tutorial, its very helpful.
I have a question about cellranger aggr and downsample normalization between batches.
My batches vary significantly in number of reads.
Is it considered standard to use the normalization=mapped flag when using aggr?
If this downsampling removes too many reads is using batch correction + Seurat normalization
sufficient in most cases?
Hi Michael,
I have not used the down-sample normalisation functionality of Cell Ranger (I primarily used Cell Ranger to prepare the raw expression matrix) but have used Seurat for normalisation and integration to deal with batch effects across different experiments. Take a look at https://satijalab.org/seurat/v3.2/integration.html and pick an approach that works with your dataset.
Cheers,
Dave
Hi, I want to ask you one question is that if i only have bam file (cellranger count make),how can i get the orignal fastq ( the author didn’t provide fastq file)
Hi Zhu,
I think it is easiest just to ask the author/s for the FASTQ files. Otherwise have you looked at https://support.10xgenomics.com/docs/bamtofastq?
Cheers,
Dave
Hi,
I am new at this field. Is there any chance to run cell ranger pipeline on a pc with 16 gb ram?
Hi there,
you will need much more computational resources to run Cell Ranger; see the official system requirements at https://support.10xgenomics.com/single-cell-gene-expression/software/overview/system-requirements.
Cheers,
Dave
Hi Dave,
I have a question about where to install bcl2fastq. I followed 10x instructions (https://kb.10xgenomics.com/hc/en-us/articles/360001618231-How-to-troubleshoot-installing-bcl2fastq-?src=distributor&lss=website&cnm=disti-web-website-page-ra_g-apac&cid=7011P000000oPEK) to download it but I don’t think I’m placing the executable in the correct folder. When I try to run analysis, I get an error saying bcl2fastq cannot be located. Please help as I am new to this.
Thanks!
Yuliya
Hi Yuliya,
you can place bcl2fastq anywhere but you need to add that location to your PATH. Do you know what shell you are using? If you type “echo $SHELL” in the terminal it should tell you what shell you are using. If you are using Bash and if the location of bcl2fastq is in /home/yuliya/bin/ (for example), you can add that path to your PATH by running:
Then when you type “which bcl2fastq” it should show /home/yuliya/bin/bcl2fastq. To avoid typing that each time, you can add that export line to your .bashrc file.
Check out this article too if you are new to all this: https://www.howtogeek.com/658904/how-to-add-a-directory-to-your-path-in-linux/.
Hope that helps,
Dave
Hi Dave,
this tutorial is really helpful Thank you for that. I am wondering which Cell Ranger version you used? I am working with dual index libraries and I used cellranger count transcriptome version 6.0.1. I am wondering does Cell Ranger uses both index files (I1 and I2 ( sample1_S1_L001_I1_001.fastq.gz and sample1_S1_L001_I2_001.fastq.gz)) to calculate such quality metrics as “Q30 Bases in Sample Index”? Do you know how Cell Ranger handles such things? I mean will it know when there are 2 index files that it is dual index? Do you know what I mean? I also read that you do not need I1 and I2 FASTQ files for running Cell Ranger if the barcode info is within the R1 file like by using the 3’v2+ version chemistry. But how will it calculate the Q30 for the sample index then? Or is this optional?
Hi Marie,
this is quite an old post, which used Cell Ranger version 3 and I have not worked with any later versions. Are you able to find your answer at https://support.10xgenomics.com/? It also looks like you can email them at support@10xgenomics.com, if you haven’t already.
Hope that helps,
Dave