Mapping qualities - Dave Tang's blog

Last updated: 2023/06/06

Current high throughput sequencers produces reads that are short; for example the HiSeq2000 produces millions of reads that are 50 and 100 bp long. To align such short reads with high speed and accuracy, many short read alignment programs have been developed, such as BWA. The major limitation is the length of the sequenced reads because many eukaryotic genomes are repetitive and therefore it is difficult to accurately map these reads. Because of this, alignment programs have mapping qualities for each read that is mapped to the reference genome. A mapping quality is basically the probability that a read is aligned in the wrong place (i.e. phred-scaled posterior probability that the mapping position of this read is incorrect). The probability is calculated as:

$latex p = 10^\frac{-q}{10} &s=2$

where $latex q$ is the quality. For example a mapping quality of 40 = 10 to the power of -4, which is 0.0001, which means there is a 0.01 percent chance that the read is aligned incorrectly.

$latex p = 10^\frac{-40}{10} = 0.0001 &s=2$

Base calling errors with respect to mapping qualities

Sequencers make base calling mistakes and this complicates matters. To illustrate how this affects the mapping qualities using BWA, I will use an example I came across in SEQanswers.

bwa

Program: bwa (alignment via Burrows-Wheeler transformation)
Version: 0.7.17-r1198-dirty
Contact: Heng Li <lh3@sanger.ac.uk>

Usage:   bwa <command> [options]

Command: index         index sequences in the FASTA format
         mem           BWA-MEM algorithm
         fastmap       identify super-maximal exact matches
         pemerge       merge overlapping paired ends (EXPERIMENTAL)
         aln           gapped/ungapped alignment
         samse         generate alignment (single ended)
         sampe         generate alignment (paired ended)
         bwasw         BWA-SW for long queries

         shm           manage indices in shared memory
         fa2pac        convert FASTA to PAC format
         pac2bwt       generate BWT from PAC
         pac2bwtgen    alternative algorithm for generating BWT
         bwtupdate     update .bwt to the new format
         bwt2sa        generate SA from BWT and Occ

Note: To use BWA, you need to first index the genome with `bwa index'.
      There are three alignment algorithms in BWA: `mem', `bwasw', and
      `aln/samse/sampe'. If you are not sure which to use, try `bwa mem'
      first. Please `man ./bwa.1' for the manual.

First let's examine mapping qualities when a read maps to a specific region without sub-optimal hits.

Here's the reference

cat ref.fa

>ref
ACGTACGTACGTACGTACGTACGTAGGG
CCCCCCCCCCCCCCCCCCCCCCCCCCCC
CCCCCCCCCCCCCCCCCCCCCCCCCCCC
CCCCCCCCCCCCCCCCCCCCCCCCCCCC
CCCCCCCCCCCCCCCCCCCCCCCCCCCC
CCCCCCCCCCCCCCCCCCCCCCCCCCCC
CCCCCCCCCCCCCCCCCCCCCCCCCCCC

Here's the read.

cat read.fa

>read
ACGTACGTACGTACGTACGTACGTA

Now to do the mapping:

bwa index ref.fa
bwa aln ref.fa read.fa > read.sai
bwa samse ref.fa read.sai read.fa > read.sam

cat read.sam

@SQ     SN:ref  LN:196
@PG     ID:bwa  PN:bwa  VN:0.7.17-r1198-dirty   CL:bwa samse ref.fa read.sai read.fa
read    0       ref     1       37      25M     *       0       0       ACGTACGTACGTACGTACGTACGTA       *       XT:A:U  NM:i:0  X0:i:1  X1:i:0  XM:i:0  XO:i:0  XG:i:0  MD:Z:25

Mapping the read to our reference, BWA returns a mapping quality of 37 (which is actually the highest mapping quality BWA returns).

Next let's create an example with sub-optimal hits. Below is a reference that contains five identical stretches of 28 mers and one 28 mer with a single mismatch (in red) compared to the other five:

ref2
ACGTACGTACGTACGTACGTACGTAGGG
ACGTACGTACGTACGTAGGTACGTAGGG
ACGTACGTACGTACGTAGGTACGTAGGG
ACGTACGTACGTACGTAGGTACGTAGGG
ACGTACGTACGTACGTAGGTACGTAGGG
ACGTACGTACGTACGTAGGTACGTAGGG

Let's map a read from the single mismatch stretch to this reference:

cat ref2.fa

>ref2
ACGTACGTACGTACGTACGTACGTAGGG
ACGTACGTACGTACGTAGGTACGTAGGG
ACGTACGTACGTACGTAGGTACGTAGGG
ACGTACGTACGTACGTAGGTACGTAGGG
ACGTACGTACGTACGTAGGTACGTAGGG
ACGTACGTACGTACGTAGGTACGTAGGG

cat read2.fa

>read2
ACGTACGTACGTACGTACGTACGTA

bwa index ref2.fa
bwa aln ref2.fa read2.fa > read2.sai
bwa samse ref2.fa read2.sai read2.fa > read2.sam
cat read2.sam

@SQ     SN:ref2 LN:168
@PG     ID:bwa  PN:bwa  VN:0.7.17-r1198-dirty   CL:bwa samse ref2.fa read2.sai read2.fa
read2   0       ref2    1       16      25M     *       0       0       ACGTACGTACGTACGTACGTACGTA       *       XT:A:U  NM:i:0  X0:i:1  X1:i:5  XM:i:0  XO:i:0  XG:i:0  MD:Z:25

The mapping quality of the read in the second example is 16, which has a probability of $latex 10^{-16/10} = 0.025119 $ of mapping to the wrong place. Even though the read maps uniquely in the reference, its mapping quality is 16 and not 37. The BWA specific tags in the SAM file provides some nice additional information:

XT - Type: Unique/Repeat/N/Mate-sw
X0 - Number of best hits
X1 - Number of suboptimal hits found by BWA

From the BWA tag information we can quickly deduce whether a read is aligned uniquely; in this case the XT:A:U indicates that it was aligned uniquely. In addition, the X1:i:5 tag indicates that there were 5 suboptimal hits.

Mapping qualities when considering base calling errors

To model base calling errors we can use the Binomial distribution; if we expect there to be 1 base calling error in 100 bps, we can calculate the probability of an error for a read of 25 nt as such using R.

# Probability of success (1 error in 100 bases) = 0.99
# Number of trials (each base is a trial) = 25

# no base calling errors, i.e. 25 successes
dbinom(x = 25, size=25, prob=0.99)
# [1] 0.7778214

# one base calling error, i.e. 24 successes
dbinom(x = 24, size=25, prob=0.99)
# [1] 0.1964195

# two base calling errors, i.e. 23 successes
dbinom(x = 23, size=25, prob=0.99)
# [1] 0.02380843

If we expect 1 base calling error in 100 bps, the probability of making two base calling errors in 25 bps is quite low. Using the formula from the SEQanswers post that calculates the posterior probability that the best alignment is actually correct:

$latex \frac{P_{zero\ base\ errors}}{P_{zero\ bases\ errors} + 5 \times P_{one\ base\ error}} &s=2$

Calculating this in R.

# the posterior probability that the best alignment is correct
p = 0.99
dbinom(x = 25, size=25, prob=p) / (dbinom(x = 25, size=25, prob=p) + (5 * dbinom(x = 24, size=25, prob=p)))
# [1] 0.4419643

In reality base calling is much more accurate than 1 error in 100 bases, which is a Phred quality score of 20. If we changed the base calling error rate to 1 in 1000 (Phred score of 30):

p = 0.999
dbinom(x = 25, size=25, prob=p) / (dbinom(x = 25, size=25, prob=p) + (5 * dbinom(x = 24, size=25, prob=p)))
# [1] 0.88879

then the posterior probability that the best alignment is correct improves to 0.88879. Using a base calling error rate of 1 in 10000 (Phred score of 40):

p = 0.9999
dbinom(x = 25, size=25, prob=p) / (dbinom(x = 25, size=25, prob=p) + (5 * dbinom(x = 24, size=25, prob=p)))
# [1] 0.9876531

improves the probability to 0.9876531, which is a ~0.012 probability that the alignment is incorrect, which is around the same ball park to the BWA mapping quality of 16, which is a 0.025 probability that the alignment is incorrect.

Does BWA make use of base calling qualities?

When I included base calling qualities to the read

cat read3.fq

@read3
ACGTACGTACGTACGTACGTACGTA
+
!!!!!!!!!!!!!!!!!!!!!!!!!

I still get the same mapping quality of 16 with BWA, indicating that mapping qualities are not used by BWA.

bwa aln ref2.fa read3.fq > read3.sai
bwa samse ref2.fa read3.sai read3.fq > read3.sam
cat read3.sam

@SQ     SN:ref2 LN:168
@PG     ID:bwa  PN:bwa  VN:0.7.17-r1198-dirty   CL:bwa samse ref2.fa read3.sai read3.fq
read3   0       ref2    1       16      25M     *       0       0       ACGTACGTACGTACGTACGTACGTA       !!!!!!!!!!!!!!!!!!!!!!!!!       XT:A:U  NM:i:0  X0:i:1  X1:i:5  XM:i:0  XO:i:0  XG:i:0  MD:Z:25

This was confirmed when I examined the BWA manual, which mentioned that "Base quality is NOT considered in evaluating hits."

BWA MEM

Let's create similar examples for BWA MEM.

cat ref3.fa

>ref3
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AGTCACGTAGCTACGATCGATCGATCGATCGATCGACTGATCGATCGATCGTGGGTACGATCGATGCTAGCTAGC
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

75 bp long read.

cat read4.fa

>read4
AGTCACGTAGCTACGATCGATCGATCGATCGATCGACTGATCGATCGATCGTGGGTACGATCGATGCTAGCTAGC

Index and map.

bwa index ref3.fa
bwa mem ref3.fa read4.fa > read4.sam
cat read4.sam

@SQ     SN:ref3 LN:450
@PG     ID:bwa  PN:bwa  VN:0.7.17-r1198-dirty   CL:bwa mem ref3.fa read4.fa
read4   0       ref3    76      60      75M     *       0       0       AGTCACGTAGCTACGATCGATCGATCGATCGATCGACTGATCGATCGATCGTGGGTACGATCGATGCTAGCTAGC     *       NM:i:0  MD:Z:75 AS:i:75 XS:i:0

The mapping quality for a unique read is 60.

The SAM tags are:

NM (integer) - Edit distance to the reference
MD (string) - String encoding mismatched and deleted reference bases
AS (integer) - Alignment score generated by aligner
XS (integer) - Score of the 2nd best alignment

Example of read mapping to two places.

cat ref4.fa

>ref4
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AGTCACGTAGCTACGATCGATCGATCGATCGATCGACTGATCGATCGATCGTGGGTACGATCGATGCTAGCTAGC
AGTCACGTAGCTACGATCGATCGATCGATCGATCGACTGATCGATCGATCGTGGGTACGATCGATGCTAGCTAGC
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

Index and map.

bwa index ref4.fa
bwa mem ref4.fa read4.fa > read4_mm.sam
cat read4_mm.sam

@SQ     SN:ref4 LN:450
@PG     ID:bwa  PN:bwa  VN:0.7.17-r1198-dirty   CL:bwa mem ref4.fa read4.fa
read4   0       ref4    151     0       75M     *       0       0       AGTCACGTAGCTACGATCGATCGATCGATCGATCGACTGATCGATCGATCGTGGGTACGATCGATGCTAGCTAGC     *       NM:i:0  MD:Z:75 AS:i:75 XS:i:75 XA:Z:ref4,+76,75M,0;

The mapping quality is now 0. In addition the score of the second best alignment (XS) is the same as the alignment score for the primary alignment (AS). In addition, we now have a XA tag, which shows the other loci that the read maps to.

This work is licensed under a Creative Commons
Attribution 4.0 International License.

5 comments Add yours

Saad says:

June 6, 2013 at 18:32

I recently emailed Heng li author of bwa. Since in the new bwa another program called bwa-mem is introduced and is recommended by the author of the software to use with longer reads >75 bp.

Bwa-mem does not seem to produce XT,XM and XO tags I was wondering what would would be the efficient way to get the optimal number of uniquely mapping reads. According to Heng li it is to use mapping quality but I have seen posts at seqanswers and biostars who say that using mapping quality “q1” through samtools would not give correct results. So my question is what value of MAPQ should be used to get the optimal number of uniquely mapping reads.

regards
Saad Khan

1. Davo says:
  
  June 7, 2013 at 00:03
  
  Hi Saad,
  
  Thanks for the comment and for the information; I haven’t tried BWA-MEM yet.
  
  I don’t know about BWA-MEM, but I know some programs that use a mapping quality of 3 to indicate a read mapping to two places. This is because 10^(-3/10) = 0.5011872. A mapping quality of 1 indicates that a read maps to five places. If that is the case, then a mapping quality of 10 gives a 90% probability that the read is mapped correctly. I don’t know what an “optimal number” is, but q10 seems to be “reasonable”.
  
  Hope that helps,
  
  Dave
  
2. Juan Montenegro says:
  
  August 13, 2013 at 00:40
  
  Hi Saad,
  I have tried BWA-MEM and found the same error. However, I think I found a way around it. At the end of each sam line you will find two tags: AS and XS. This two tags will report the current alignment score and the secondary alignment score respectively. If both scores are the same, then you have a read with multiple optimal alignments (MOA).
  A simple perl or python script can filter those out. Beware that this approach will remove any read with MOA, even if its mate is uniquely mapped.
  
  Regards,
  
  Juan Montenegro
  
  1. png says:
    
    May 5, 2023 at 05:33
    
    Thanks for explaining AS and XS. I couldn’t find it anywhere in the bwa-mem documentation.
    
Brian says:

August 27, 2020 at 10:01

Thanks Dave,
Even six years later, this post was very useful to me.
Regards, Brian

Base calling errors with respect to mapping qualities

Mapping qualities when considering base calling errors

Does BWA make use of base calling qualities?

BWA MEM

Leave a Reply Cancel reply