The Chromium Single Cell 3' Solution is a commercial platform developed by 10x Genomics for preparing single cell cDNA libraries for performing single cell RNA-seq. In addition, 10x Genomics have developed an entire software suite called Cell Ranger that can process the raw BCL files produced by an Illumina sequencer and output a final gene-barcode expression matrix.
However, Cell Ranger produces various files at the end of the pipeline, including summary, BAM, and HDF files. This post is specifically on the annotations present in the BAM file.
10x Genomics have a page explaining the BAM output file. However, in this post I'll provide a bit more elaboration with examples. To get started, let's take a look at a single entry, the first line, of a BAM file generated by Cell Ranger on the 8k PBMCs sample.
# transpose is a Perl script I wrote; see https://davetang.org/muse/2014/09/09/transpose-tool/ # this line simply takes the first mapped read and prints out each section on a new line samtools view http://s3-us-west-2.amazonaws.com/10x.files/samples/cell-exp/2.1.0/pbmc8k/pbmc8k_possorted_genome_bam.bam | head -1 | transpose | nl [knet_seek] SEEK_END is not supported for HTTP. Offset is unchanged. 1 ST-K00126:314:HFYL2BBXX:7:2103:14996:4725 2 272 3 1 4 10001 5 1 6 3S95M 7 * 8 0 9 0 10 CCGTAGCCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACC 11 <)-7-)7<JF7<FA7---<<A-FFFF-A<<AA----FFAJJJAAAJJAFJJJF<JJJA---FFA-AFFFFJ<FJF-FFFA7FFFJ7JFFAJ7FAFF<A 12 NH:i:3 13 HI:i:3 14 AS:i:91 15 nM:i:1 16 RE:A:I 17 BC:Z:TCGTCACG 18 QT:Z:AAFFFJJJ 19 CR:Z:TTAACTCGTAGAAGGA 20 CY:Z:AAFFFJJJJJJJJJJJ 21 CB:Z:TTAACTCGTAGAAGGA-1 22 UR:Z:GTCCGGCGAC 23 UY:Z:JJJJJJJJJJ 24 UB:Z:GTCCGGCGAC 25 RG:Z:pbmc8k:MissingLibrary:1:HFYL2BBXX:7
The first 11 lines are standard BAM fields and the lines afterwards are BAM tags. Some of the BAM tags are standard and some are specifically generated by the Cell Ranger software. The table below describes each field and tag in the BAM file.
Line | Tag | Description |
---|---|---|
1 | NA | Query name |
2 | NA | Flag |
3 | NA | Reference name |
4 | NA | Position |
5 | NA | Mapping quality |
6 | NA | Cigar string |
7 | NA | Reference name of mate |
8 | NA | Position of mate |
9 | NA | Template length |
10 | NA | Sequence |
11 | NA | Sequence quality |
12 | NH | Number of reported alignments for query |
13 | HI | Query hit index |
14 | AS | Alignment score |
15 | nM | Number of mismatches per pair |
16 | RE | Region type (E = exonic, N = intronic, I = intergenic) |
17 | BC | Sample index read |
18 | QT | Sample index read quality |
19 | CR | Cell barcode |
20 | CY | Cell barcode read quality |
21 | CB | Cell barcode that is error-corrected and confirmed against a list of known-good barcode sequences |
22 | UR | Unique Molecular Identifier (UMI) |
23 | UY | UMI read quality |
24 | UB | UMI that is error-corrected among other molecular barcodes with the same cellular barcode and gene alignment |
25 | RG | Read group |
To help put the BAM tags into perspective, here's how a final library fragment looks like.
The cell barcode (10XBC) is used to associate cDNAs to a specific cell and the UMIs are used to label specific cDNA molecules. After sequencing, Cell Ranger (using bcl2fastq) converts the BCL files into a set of FASTQ files. In each set are three files: R1, I1, and R2, which contains the sequence of the cell barcode + UMI, the sample index, and the cDNA sequence, respectively.
The first read (R1) contains the cell barcode sequence, which is 16 nt long and a UMI sequence that is 10 nt long. To illustrate this, I downloaded the raw FASTQ files for this dataset. I perform a grep on the R1 file to search for the specific read (the first line of the BAM file) and then output the match along with the next three lines after the match.
zcat pbmc8k_S1_L007_R1_001.fastq.gz | grep -A3 ST-K00126:314:HFYL2BBXX:7:2103:14996:4725 @ST-K00126:314:HFYL2BBXX:7:2103:14996:4725 1:N:0:TCGTCACG TTAACTCGTAGAAGGAGTCCGGCGAC + AAFFFJJJJJJJJJJJJJJJJJJJJJ
The first 16 nt of the sequence output is the cell barcode, TTAACTCGTAGAAGGA, and the last 10 nt is the UMI, GTCCGGCGAC; this information gets incorporated into the BAM file as the CR and UR tags, respectively. The base calling quality of these bases, AAFFFJJJJJJJJJJJ and JJJJJJJJJJ are stored as the CY and UY tags, respectively.
In addition to the CR and UR tags, there are the CB and UB tags, which contain the error-corrected cell barcode and UMI sequences. Errors in the cell barcode and UMI sequence can occur during the PCR amplification and sequencing steps, so the CB and UB tags contain the corrected sequences. However, I'm not sure how Cell Ranger performs the error correction; once I find out, I'll update this post.
One approach for mitigating errors in UMIs is implemented in UMI-tools. It had been observed that a lot of UMIs mapping to the same exact position differed by only 1 edit distance, i.e. one sequence difference, for example between AAAA and AACA. This highly suggests that these "unique" UMIs are actually a result of PCR and/or sequencing errors. The authors of UMI-tools used a network approach to join UMIs that map to the same location and only differed by 1 bp to estimate the true UMI count by discarding unique UMIs that were most likely generated by errors; you can read the paper here. Anyway back to the BAM tags.
Many (if not all) of Illumina's sequencers support multiplexing, i.e. indexing, which is what the 8 nt i7 sample index is used for. This is sequenced and stored in a separate file (I1).
zcat pbmc8k_S1_L007_I1_001.fastq.gz | grep -A3 ST-K00126:314:HFYL2BBXX:7:2103:14996:4725 @ST-K00126:314:HFYL2BBXX:7:2103:14996:4725 1:N:0:TCGTCACG TCGTCACG + AAFFFJJJ
The sample index sequence and its base calling qualities are stored in the BC and QT tags, respectively.
Finally, the second read contains a partial sequence of the cDNA (98 nt long). Since the sequencing of read 2 goes from the 3' to 5' end, there is an enrichment of sequences from the 3' end, hence the protocol is called Chromium Single Cell 3' Solution. (However, I believe that the enzymatic fragmentation of the double-stranded cDNA is random, which is why there is sequencing throughout a cDNA sequence.) The sequence and base calling quality are stored in the 10th and 11th fields, respectively, of the BAM file.
zcat pbmc8k_S1_L007_R2_001.fastq.gz | grep -A3 ST-K00126:314:HFYL2BBXX:7:2103:14996:4725 @ST-K00126:314:HFYL2BBXX:7:2103:14996:4725 2:N:0:TCGTCACG GGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGCTACGG + A<FFAF7JAFFJ7JFFF7AFFF-FJF<JFFFFA-AFF---AJJJ<FJJJFAJJAAAJJJAFF----AA<<A-FFFF-A<<---7AF<7FJ<7)-7-)<
Final note: I probably didn't need to use "respectively" that often.
This work is licensed under a Creative Commons
Attribution 4.0 International License.
I am looking for some script that could extract the cell-wise statistics (listed below) from the CellRanger bam file?
Total Number of reads
Total Number of mapped reads
Total Number of unmapped reads
Total Number of reads mapped to exonic regions
Total Number of reads mapped to intronic regions
Total Number of reads mapped to intragenic regions
Do you have any suggestions?
It has been a while since I used CellRanger, so what I suggest is based on outdated information.
You could do something like this:
samtools view -@ 8 40k_NSCLC_DTC_3p_HT_nextgem_Multiplex_count_unassigned_alignments.bam | head -10000 | perl -nle 'if(/RE:A:([ENI])/){ $r = $1 } else { $r = "U" }; $b = $1 if /CB:Z:([ACGT0-9-]+)/; print "$b\t$r"' | sort -k1,1 -k2,2 | uniq -c | sort -k1,1 | head
1 AAACCCACACCTGCGA-1 I
1 AAACCCAGTACAAGCG-1 I
1 AAACCCAGTGTTATCG-1 I
1 AAACCCAGTTGCATCA-1 I
1 AAACGAAAGCACCCAC-1 I
1 AAACGAAAGTCATCCA-1 I
1 AAACGAAAGTGTGTTC-1 I
1 AAACGAACACTGAGGA-1 I
1 AAACGAATCATCTCTA-1 I
1 AAACGCTAGAAACTAC-1 I
The line above streams the SAMtools output to Perl, which extracts the cell barcode and mapping region (U for unmapped), sorts by cell barcode and mapping region, tallies the counts, and then sorts by cell barcode again. This could take a long time to run depending on how big your BAM file is. If you want to try it, take out the head command in the pipeline.
If you have a list of whitelisted barcodes, you could write a script to speed this up (since we no longer have to sort) by tallying the regions for the barcodes as they occur line by line.