Updated 2014 December 17th

Current high throughput sequencers produces reads that are short; for example the HiSeq2000 produces millions of reads that are 50 and 100 bp long. To align such short reads with high speed and accuracy, many short read alignment programs have been developed, such as BWA. The major limitation is the length of the sequenced reads because many eukaryotic genomes are repetitive and therefore it is difficult to accurately map these reads. Because of this, alignment programs have mapping qualities for each read that is mapped to the reference genome. A mapping quality is basically the probability that a read is aligned in the wrong place (i.e. phred-scaled posterior probability that the mapping position of this read is incorrect). The probability is calculated as:

$$!p = 10^{-q/10}$$

where q is the quality. For example a mapping quality of 40 = 10 to the power of -4, which is 0.0001, which means there is a 0.01 percent chance that the read is aligned incorrectly.

$$!p = 10^{-40/10} = 0.0001$$

### Base calling errors with respect to mapping qualities

Sequencers make base calling mistakes and this complicates matters. To illustrate how this affects the mapping qualities using BWA, I will use an example I came across in SEQanswers. First let’s examine mapping qualities when a read maps to a specific region without suboptimal hits:

#what version of BWA bwa 2>&1 | head -4 Program: bwa (alignment via Burrows-Wheeler transformation) Version: 0.7.10-r789 Contact: Heng Li <lh3@sanger.ac.uk> cat ref.fa >ref ACGTACGTACGTACGTACGTACGTAGGG CCCCCCCCCCCCCCCCCCCCCCCCCCCC CCCCCCCCCCCCCCCCCCCCCCCCCCCC CCCCCCCCCCCCCCCCCCCCCCCCCCCC CCCCCCCCCCCCCCCCCCCCCCCCCCCC CCCCCCCCCCCCCCCCCCCCCCCCCCCC CCCCCCCCCCCCCCCCCCCCCCCCCCCC cat read.fa >read ACGTACGTACGTACGTACGTACGTA

Now to do the mapping:

bwa index ref.fa bwa aln ref.fa read.fa > read.sai bwa samse ref.fa read.sai read.fa > read.sam #scroll to see the rest of the SAM file cat read.sam @SQ SN:ref LN:196 @PG ID:bwa PN:bwa VN:0.7.10-r789 CL:bwa samse ref.fa read.sai read.fa read 0 ref 1 37 25M * 0 0 ACGTACGTACGTACGTACGTACGTA * XT:A:U NM:i:0 X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:25

Mapping the read to our reference, BWA returns a mapping quality of 37 (which is actually the highest mapping quality BWA returns).

Next let’s create an example with suboptimal hits. Below is a reference that contains five identical stretches of 28 mers and one 28 mer with a single mismatch (in red) compared to the other five:

>ref2

ACGTACGTACGTACGTACGTACGTAGGG

ACGTACGTACGTACGTAGGTACGTAGGG

ACGTACGTACGTACGTAGGTACGTAGGG

ACGTACGTACGTACGTAGGTACGTAGGG

ACGTACGTACGTACGTAGGTACGTAGGG

ACGTACGTACGTACGTAGGTACGTAGGG

Let’s map a read from the single mismatch stretch to this reference:

cat ref2.fa >ref2 ACGTACGTACGTACGTACGTACGTAGGG ACGTACGTACGTACGTAGGTACGTAGGG ACGTACGTACGTACGTAGGTACGTAGGG ACGTACGTACGTACGTAGGTACGTAGGG ACGTACGTACGTACGTAGGTACGTAGGG ACGTACGTACGTACGTAGGTACGTAGGG cat read2.fa >read2 ACGTACGTACGTACGTACGTACGTA bwa index ref2.fa bwa aln ref2.fa read2.fa > read2.sai bwa samse ref2.fa read2.sai read2.fa > read2.sam #scroll to see the rest of the SAM file cat read2.sam @SQ SN:ref2 LN:168 @PG ID:bwa PN:bwa VN:0.7.10-r789 CL:bwa samse ref2.fa read2.sai read2.fa read2 0 ref2 1 16 25M * 0 0 ACGTACGTACGTACGTACGTACGTA * XT:A:U NM:i:0 X0:i:1 X1:i:5 XM:i:0 XO:i:0 XG:i:0 MD:Z:25

The mapping quality of the read in the second example is 16, which has a probability of $$ 10^{-16/10} = 0.025119 $$ of mapping to the wrong place. Even though the read maps uniquely in the reference, its mapping quality is 16 and not 37. The BWA specific tags in the SAM file provides some nice additional information:

**XT** Type: Unique/Repeat/N/Mate-sw

**X0** Number of best hits

**X1** Number of suboptimal hits found by BWA

From the BWA tag information we can quickly deduce whether a read is aligned uniquely; in this case the XT:A:U indicates that it was aligned uniquely. In addition, the X1:i:5 tag indicates that there were 5 suboptimal hits.

### Mapping qualities when considering base calling errors

To model base calling errors we can use the Binomial distribution; if I expect there to be 1 base calling error in 100 bps, I can calculate the probability of an error for a read of 25 nt as such using R

#Probability of success (1 error in 100 bases) = 0.99 #Number of trials (each base is a trial) = 25 #no base calling errors, i.e. 25 successes dbinom(x = 25, size=25, prob=0.99) #[1] 0.7778214 #one base calling error, i.e. 24 successes dbinom(x = 24, size=25, prob=0.99) #[1] 0.1964195 #two base calling errors, i.e. 23 successes dbinom(x = 23, size=25, prob=0.99) #[1] 0.02380843

If we expect 1 base calling error in 100 bps, the probability of making two base calling errors in 25 bps is quite low. Using the formula from the SEQanswers post that calculates the posterior probability that the best alignment is actually correct:

$$! \frac{P_{zero\ base\ errors}}{P_{zero\ bases\ errors} + 5 \times P_{one\ base\ error}} $$

Calculating this in R:

#the posterior probability that the best alignment is correct p = 0.99 dbinom(x = 25, size=25, prob=p) / (dbinom(x = 25, size=25, prob=p) + (5 * dbinom(x = 24, size=25, prob=p))) #[1] 0.4419643

In reality base calling is much more accurate than 1 error in 100 bases, which is a Phred quality score of 20. If we changed the base calling error rate to 1 in 1000 (Phred score of 30):

p = 0.999 dbinom(x = 25, size=25, prob=p) / (dbinom(x = 25, size=25, prob=p) + (5 * dbinom(x = 24, size=25, prob=p))) #[1] 0.88879

then the posterior probability that the best alignment is correct improves to 0.88879. Using a base calling error rate of 1 in 10000 (Phred score of 40):

p = 0.9999 dbinom(x = 25, size=25, prob=p) / (dbinom(x = 25, size=25, prob=p) + (5 * dbinom(x = 24, size=25, prob=p))) #[1] 0.9876531

improves the probability to 0.9876531, which is a ~0.012 probability that the alignment is incorrect, which is around the same ball park to the BWA mapping quality of 16, which is a 0.025 probability that the alignment is incorrect.

### Does BWA make use of base calling qualities?

When I included base calling qualities to the read

@read

ACGTACGTACGTACGTACGTACGTA

+

!!!!!!!!!!!!!!!!!!!!!!!!!

I still get the same mapping quality of 16 with BWA, indicating that mapping qualities are not used by BWA:

tag 0 artificial 1 16 25M * 0 0 ACGTACGTACGTACGTACGTACGTA !!!!!!!!!!!!!!!!!!!!!!!!! XT:A:U NM:i:0 X0:i:1 X1:i:5 XM:i:0 XO:i:0 XG:i:0 MD:Z:25

This was confirmed when I examined the BWA manual, which mentioned that “Base quality is NOT considered in evaluating hits.”

This work is licensed under a Creative Commons

Attribution 4.0 International License.

I recently emailed Heng li author of bwa. Since in the new bwa another program called bwa-mem is introduced and is recommended by the author of the software to use with longer reads >75 bp.

Bwa-mem does not seem to produce XT,XM and XO tags I was wondering what would would be the efficient way to get the optimal number of uniquely mapping reads. According to Heng li it is to use mapping quality but I have seen posts at seqanswers and biostars who say that using mapping quality “q1” through samtools would not give correct results. So my question is what value of MAPQ should be used to get the optimal number of uniquely mapping reads.

regards

Saad Khan

Hi Saad,

Thanks for the comment and for the information; I haven’t tried BWA-MEM yet.

I don’t know about BWA-MEM, but I know some programs that use a mapping quality of 3 to indicate a read mapping to two places. This is because 10^(-3/10) = 0.5011872. A mapping quality of 1 indicates that a read maps to five places. If that is the case, then a mapping quality of 10 gives a 90% probability that the read is mapped correctly. I don’t know what an “optimal number” is, but q10 seems to be “reasonable”.

Hope that helps,

Dave

Hi Saad,

I have tried BWA-MEM and found the same error. However, I think I found a way around it. At the end of each sam line you will find two tags: AS and XS. This two tags will report the current alignment score and the secondary alignment score respectively. If both scores are the same, then you have a read with multiple optimal alignments (MOA).

A simple perl or python script can filter those out. Beware that this approach will remove any read with MOA, even if its mate is uniquely mapped.

Regards,

Juan Montenegro