10x single cell BAM files

The Chromium Single Cell 3' Solution is a commercial platform developed by 10x Genomics for preparing single cell cDNA libraries for performing single cell RNA-seq. In addition, 10x Genomics have developed an entire software suite called Cell Ranger that can process the raw BCL files produced by an Illumina sequencer and output a final gene-barcode expression matrix.

However, Cell Ranger produces various files at the end of the pipeline, including summary, BAM, and HDF files. This post is specifically on the annotations present in the BAM file.

10x Genomics have a page explaining the BAM output file. However, in this post I'll provide a bit more elaboration with examples. To get started, let's take a look at a single entry, the first line, of a BAM file generated by Cell Ranger on the 8k PBMCs sample.

# transpose is a Perl script I wrote; see https://davetang.org/muse/2014/09/09/transpose-tool/
# this line simply takes the first mapped read and prints out each section on a new line
samtools view http://s3-us-west-2.amazonaws.com/10x.files/samples/cell-exp/2.1.0/pbmc8k/pbmc8k_possorted_genome_bam.bam | head -1 | transpose | nl
[knet_seek] SEEK_END is not supported for HTTP. Offset is unchanged.
     1  ST-K00126:314:HFYL2BBXX:7:2103:14996:4725
     2  272
     3  1
     4  10001
     5  1
     6  3S95M
     7  *
     8  0
     9  0
    10  CCGTAGCCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACC
    11  <)-7-)7<JF7<FA7---<<A-FFFF-A<<AA----FFAJJJAAAJJAFJJJF<JJJA---FFA-AFFFFJ<FJF-FFFA7FFFJ7JFFAJ7FAFF<A
    12  NH:i:3
    13  HI:i:3
    14  AS:i:91
    15  nM:i:1
    16  RE:A:I
    17  BC:Z:TCGTCACG
    18  QT:Z:AAFFFJJJ
    19  CR:Z:TTAACTCGTAGAAGGA
    20  CY:Z:AAFFFJJJJJJJJJJJ
    21  CB:Z:TTAACTCGTAGAAGGA-1
    22  UR:Z:GTCCGGCGAC
    23  UY:Z:JJJJJJJJJJ
    24  UB:Z:GTCCGGCGAC
    25  RG:Z:pbmc8k:MissingLibrary:1:HFYL2BBXX:7

The first 11 lines are standard BAM fields and the lines afterwards are BAM tags. Some of the BAM tags are standard and some are specifically generated by the Cell Ranger software. The table below describes each field and tag in the BAM file.

Line Tag Description
1 NA Query name
2 NA Flag
3 NA Reference name
4 NA Position
5 NA Mapping quality
6 NA Cigar string
7 NA Reference name of mate
8 NA Position of mate
9 NA Template length
10 NA Sequence
11 NA Sequence quality
12 NH Number of reported alignments for query
13 HI Query hit index
14 AS Alignment score
15 nM Number of mismatches per pair
16 RE Region type (E = exonic, N = intronic, I = intergenic)
17 BC Sample index read
18 QT Sample index read quality
19 CR Cell barcode
20 CY Cell barcode read quality
21 CB Cell barcode that is error-corrected and confirmed against a list of known-good barcode sequences
22 UR Unique Molecular Identifier (UMI)
23 UY UMI read quality
24 UB UMI that is error-corrected among other molecular barcodes with the same cellular barcode and gene alignment
25 RG Read group

To help put the BAM tags into perspective, here's how a final library fragment looks like.

The cell barcode (10XBC) is used to associate cDNAs to a specific cell and the UMIs are used to label specific cDNA molecules. After sequencing, Cell Ranger (using bcl2fastq) converts the BCL files into a set of FASTQ files. In each set are three files: R1, I1, and R2, which contains the sequence of the cell barcode + UMI, the sample index, and the cDNA sequence, respectively.

The first read (R1) contains the cell barcode sequence, which is 16 nt long and a UMI sequence that is 10 nt long. To illustrate this, I downloaded the raw FASTQ files for this dataset. I perform a grep on the R1 file to search for the specific read (the first line of the BAM file) and then output the match along with the next three lines after the match.

zcat pbmc8k_S1_L007_R1_001.fastq.gz | grep -A3 ST-K00126:314:HFYL2BBXX:7:2103:14996:4725
@ST-K00126:314:HFYL2BBXX:7:2103:14996:4725 1:N:0:TCGTCACG
TTAACTCGTAGAAGGAGTCCGGCGAC
+
AAFFFJJJJJJJJJJJJJJJJJJJJJ

The first 16 nt of the sequence output is the cell barcode, TTAACTCGTAGAAGGA, and the last 10 nt is the UMI, GTCCGGCGAC; this information gets incorporated into the BAM file as the CR and UR tags, respectively. The base calling quality of these bases, AAFFFJJJJJJJJJJJ and JJJJJJJJJJ are stored as the CY and UY tags, respectively.

In addition to the CR and UR tags, there are the CB and UB tags, which contain the error-corrected cell barcode and UMI sequences. Errors in the cell barcode and UMI sequence can occur during the PCR amplification and sequencing steps, so the CB and UB tags contain the corrected sequences. However, I'm not sure how Cell Ranger performs the error correction; once I find out, I'll update this post.

One approach for mitigating errors in UMIs is implemented in UMI-tools. It had been observed that a lot of UMIs mapping to the same exact position differed by only 1 edit distance, i.e. one sequence difference, for example between AAAA and AACA. This highly suggests that these "unique" UMIs are actually a result of PCR and/or sequencing errors. The authors of UMI-tools used a network approach to join UMIs that map to the same location and only differed by 1 bp to estimate the true UMI count by discarding unique UMIs that were most likely generated by errors; you can read the paper here. Anyway back to the BAM tags.

Many (if not all) of Illumina's sequencers support multiplexing, i.e. indexing, which is what the 8 nt i7 sample index is used for. This is sequenced and stored in a separate file (I1).

zcat pbmc8k_S1_L007_I1_001.fastq.gz  | grep -A3 ST-K00126:314:HFYL2BBXX:7:2103:14996:4725
@ST-K00126:314:HFYL2BBXX:7:2103:14996:4725 1:N:0:TCGTCACG
TCGTCACG
+
AAFFFJJJ

The sample index sequence and its base calling qualities are stored in the BC and QT tags, respectively.

Finally, the second read contains a partial sequence of the cDNA (98 nt long). Since the sequencing of read 2 goes from the 3' to 5' end, there is an enrichment of sequences from the 3' end, hence the protocol is called Chromium Single Cell 3' Solution. (However, I believe that the enzymatic fragmentation of the double-stranded cDNA is random, which is why there is sequencing throughout a cDNA sequence.) The sequence and base calling quality are stored in the 10th and 11th fields, respectively, of the BAM file.

zcat pbmc8k_S1_L007_R2_001.fastq.gz | grep -A3 ST-K00126:314:HFYL2BBXX:7:2103:14996:4725
@ST-K00126:314:HFYL2BBXX:7:2103:14996:4725 2:N:0:TCGTCACG
GGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGCTACGG
+
A<FFAF7JAFFJ7JFFF7AFFF-FJF<JFFFFA-AFF---AJJJ<FJJJFAJJAAAJJJAFF----AA<<A-FFFF-A<<---7AF<7FJ<7)-7-)<

Final note: I probably didn't need to use "respectively" that often.

Print Friendly, PDF & Email



Creative Commons License
This work is licensed under a Creative Commons
Attribution 4.0 International License
.
2 comments Add yours
  1. I am looking for some script that could extract the cell-wise statistics (listed below) from the CellRanger bam file?
    Total Number of reads
    Total Number of mapped reads
    Total Number of unmapped reads
    Total Number of reads mapped to exonic regions
    Total Number of reads mapped to intronic regions
    Total Number of reads mapped to intragenic regions

    Do you have any suggestions?

    1. It has been a while since I used CellRanger, so what I suggest is based on outdated information.

      You could do something like this:


      samtools view -@ 8 40k_NSCLC_DTC_3p_HT_nextgem_Multiplex_count_unassigned_alignments.bam | head -10000 | perl -nle 'if(/RE:A:([ENI])/){ $r = $1 } else { $r = "U" }; $b = $1 if /CB:Z:([ACGT0-9-]+)/; print "$b\t$r"' | sort -k1,1 -k2,2 | uniq -c | sort -k1,1 | head
      1 AAACCCACACCTGCGA-1 I
      1 AAACCCAGTACAAGCG-1 I
      1 AAACCCAGTGTTATCG-1 I
      1 AAACCCAGTTGCATCA-1 I
      1 AAACGAAAGCACCCAC-1 I
      1 AAACGAAAGTCATCCA-1 I
      1 AAACGAAAGTGTGTTC-1 I
      1 AAACGAACACTGAGGA-1 I
      1 AAACGAATCATCTCTA-1 I
      1 AAACGCTAGAAACTAC-1 I

      The line above streams the SAMtools output to Perl, which extracts the cell barcode and mapping region (U for unmapped), sorts by cell barcode and mapping region, tallies the counts, and then sorts by cell barcode again. This could take a long time to run depending on how big your BAM file is. If you want to try it, take out the head command in the pipeline.

      If you have a list of whitelisted barcodes, you could write a script to speed this up (since we no longer have to sort) by tallying the regions for the barcodes as they occur line by line.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.