BWA and multi-mapping reads

2014 June 6th: Major rewrite of post

The Burrows-Wheeler Aligner (BWA) is a popular short read alignment program. Here I test the program with an artificial reference sequence. First download and compile the program:

wget http://sourceforge.net/projects/bio-bwa/files/bwa-0.7.9a.tar.bz2
tar -xjf bwa-0.7.9a.tar.bz2
cd bwa-0.7.9a/
make

Let’s make up some artificial reference sequence (ref.fa):

>artificial
ACGTACGTACGTACGTAGGTACGTAGGG
ACGTACGTACGTACGTAGGTACGTAGGG
ACGTACGTACGTACGTAGGTACGTAGGG
ACGTACGTACGTACGTAGGTACGTAGGG
ACGTACGTACGTACGTAGGTACGTAGGG

and this read (read.fa):

>tag
ACGTACGTACGTACGTAGGTACGTA

Now let’s align the read to the reference:

#index the reference
bwa index ref.fa

#perform the alignment
bwa aln ref.fa read.fa > read.sai

#generate the alignment
bwa samse ref.fa read.sai read.fa > read.sam

#check out the alignment
cat read.sam
@SQ     SN:artificial   LN:140
@PG     ID:bwa  PN:bwa  VN:0.7.9a-r786  CL:bwa samse ref.fa read.sai read.fa
tag     0       artificial      57      0       25M     *       0       0       ACGTACGTACGTACGTAGGTACGTA       *       XT:A:R  NM:i:0  X0:i:5  X1:i:0  XM:i:0  XO:i:0  XG:i:0  MD:Z:25

The SAM tags starting with an X are BWA specific. The X0 lists the number of best hits which in this case is 5, resulting in a mapping quality score of 0; reads that map equally well to multiple places are assigned a mapping quality of 0. Check out my post on mapping quality, if you are unfamiliar with the concept. The XT tag shows that it is a repeated sequence; a uniquely mapping read will have a U.

If I change the reference (ref2.fa) to:

>artificial
ACGTACGTACGTACGTAGGTACGTAGGG
ACGTACGTACGTACGTAGGTACGTAGGG
ACGTACGTACGTACGTAGGTACGTAGGG
ACGTACGTACGTACGTAGGTACGTAGGG
ACGTACGTACGTACGTAGGTACGTAGGG
ACGTACGTACGTACGTAGGTACGTAGGG
ACGTACGTACGTACGTAGGTACGTAGGG
ACGTACGTACGTACGTAGGTACGTAGGG
ACGTACGTACGTACGTAGGTACGTAGGG
ACGTACGTACGTACGTAGGTACGTAGGG
ACGTACGTACGTACGTAGGTACGTAGGG
ACGTACGTACGTACGTAGGTACGTAGGG
ACGTACGTACGTACGTAGGTACGTAGGG
ACGTACGTACGTACGTAGGTACGTAGGG
ACGTACGTACGTACGTAGGTACGTAGGG
ACGTACGTACGTACGTAGGTACGTAGGG
ACGTACGTACGTACGTAGGTACGTAGGG
ACGTACGTACGTACGTAGGTACGTAGGG
ACGTACGTACGTACGTAGGTACGTAGGG
ACGTACGTACGTACGTAGGTACGTAGGG

and perform the same steps again:

bwa index ref2.fa
bwa aln ref2.fa read.fa > read_ref2.sai
bwa samse ref2.fa read_ref2.sai read.fa > read_ref2.sam
cat read_ref2.sam
@SQ     SN:artificial   LN:560
@PG     ID:bwa  PN:bwa  VN:0.7.9a-r786  CL:bwa samse ref2.fa read_ref2.sai read.fa
tag     0       artificial      281     0       25M     *       0       0       ACGTACGTACGTACGTAGGTACGTA       *       XT:A:R  NM:i:0  X0:i:20 X1:i:0  XM:i:0  XO:i:0  XG:i:0  MD:Z:25

Of note is that the X0 tag is now 20 and the mapping position of the read has changed from 57 to 281. Interestingly this is the middle of the reference sequence, which was also the case in the first example. I will create some more examples to see if BWA always picks the middle position out of a list of potential mapping locations.

cat ref.fa
>ref
GCCGGGCGTGGTGGCTTACGCCTGTAATCC
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
GCCGGGCGTGGTGGCTTACGCCTGTAATCC
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
GCCGGGCGTGGTGGCTTACGCCTGTAATCC
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
GCCGGGCGTGGTGGCTTACGCCTGTAATCC
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
GCCGGGCGTGGTGGCTTACGCCTGTAATCC
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
GCCGGGCGTGGTGGCTTACGCCTGTAATCC
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
GCCGGGCGTGGTGGCTTACGCCTGTAATCC
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

cat read.fa
>read
GCCGGGCGTGGTGGCTTACGCCTGTAATCC

bwa index ref.fa
bwa aln ref.fa read.fa > read.sai
bwa samse ref.fa read.sai read.fa > read.sam
cat read.sam
@SQ     SN:ref  LN:420
@PG     ID:bwa  PN:bwa  VN:0.7.9a-r786  CL:bwa samse ref.fa read.sai read.fa
read    0       ref     181     0       30M     *       0       0       GCCGGGCGTGGTGGCTTACGCCTGTAATCC  *       XT:A:R  NM:i:0  X0:i:7  X1:i:0  XM:i:0  XO:i:0  XG:i:0  MD:Z:30

Again it chooses the middle list of alignment possibilities. Is this behaviour true if I have multiple fasta entries in the reference, such as different chromosomes?

cat ref.fa
>ref
GCCGGGCGTGGTGGCTTACGCCTGTAATCC
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
GCCGGGCGTGGTGGCTTACGCCTGTAATCC
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
GCCGGGCGTGGTGGCTTACGCCTGTAATCC
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
>ref2
GCCGGGCGTGGTGGCTTACGCCTGTAATCC
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
GCCGGGCGTGGTGGCTTACGCCTGTAATCC
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
GCCGGGCGTGGTGGCTTACGCCTGTAATCC
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
GCCGGGCGTGGTGGCTTACGCCTGTAATCC
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

cat read.fa
>read
GCCGGGCGTGGTGGCTTACGCCTGTAATCC

bwa index ref.fa
bwa aln ref.fa read.fa > read.sai
bwa samse ref.fa read.sai read.fa > read.sam
cat read.sam
@SQ     SN:ref  LN:180
@SQ     SN:ref2 LN:240
@PG     ID:bwa  PN:bwa  VN:0.7.9a-r786  CL:bwa samse ref.fa read.sai read.fa
read    0       ref2    1       0       30M     *       0       0       GCCGGGCGTGGTGGCTTACGCCTGTAATCC  *       XT:A:R  NM:i:0  X0:i:7  X1:i:0  XM:i:0  XO:i:0  XG:i:0  MD:Z:30

At least with these simplistic examples, it seems that if a read is mapped equally well at multiple locations, the reported location is middle location of the list of multiple hit locations.

I had always thought that if a read mapped to multiple locations, a random location would be chosen, as stated in the manual for samse:

Repetitive hits will be randomly chosen.

Or did I do something wrong?

Print Friendly, PDF & Email



Creative Commons License
This work is licensed under a Creative Commons
Attribution 4.0 International License
.
3 comments Add yours
  1. Hi Dave,

    Even if this is an old post, I had similar questions, and I used your post as a starting point. What I found, is that bwa mem randomly assign reads as it should (I used bwa 0.7.16a-r1185-dirty), but it does so in a reproducible manner: if you run 10 times the same multimapping read to the artificial reference, the primary alignment position will be always the same. However, if your read file contains 100,000 identical multimapping reads, they will be distributed more or less equally among the possible hits in the reference genome (see below for code).

    Notably, there was a bugfix in Release 0.7.6 (31 Januaray, 2014; “Bugfix: for a repetitive single-end read, the reported hit is not randomly distributed among equally best hits.”). Thus, it is possible that the unexpected behavior that you observed was fixed since your blog post.

    Anyway, great blog!
    Regards

    $ cat ref.fa
    >ref1
    GCCGGGCGTGGTGGCTTACGCCTGTAATCCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGCCGGGCGTGGTGGCTTACGCCTGTAATCCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGCCGGGCGTGGTGGCTTACGCCTGTAATCCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
    >ref2
    GCCGGGCGTGGTGGCTTACGCCTGTAATCCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGCCGGGCGTGGTGGCTTACGCCTGTAATCCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGCCGGGCGTGGTGGCTTACGCCTGTAATCCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGCCGGGCGTGGTGGCTTACGCCTGTAATCCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
    >ref3
    GCCGGGCGTGGTGGCTTACGCCTGTAATCCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGCCGGGCGTGGTGGCTTACGCCTGTAATCCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGCCGGGCGTGGTGGCTTACGCCTGTAATCCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGCCGGGCGTGGTGGCTTACGCCTGTAATCCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGCCGGGCGTGGTGGCTTACGCCTGTAATCCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGCCGGGCGTGGTGGCTTACGCCTGTAATCCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGCCGGGCGTGGTGGCTTACGCCTGTAATCCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGCCGGGCGTGGTGGCTTACGCCTGTAATCCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

    $ awk ‘BEGIN{for (i=1;iread” i,”\nGCCGGGCGTGGTGGCTTACGCCTGTAATCC”}}’ > reads.fa

    $ bwa mem ref.fa reads.fa | awk ‘{print $3?-“$4}’ | sort -k1,1 | uniq -c
    6521 ref1-1
    6659 ref1-121
    6609 ref1-61
    6675 ref2-1
    6681 ref2-121
    6701 ref2-181
    6701 ref2-61
    6595 ref3-1
    6722 ref3-121
    6759 ref3-181
    6693 ref3-241
    6726 ref3-301
    6666 ref3-361
    6548 ref3-421
    6744 ref3-61

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.