2014 June 6th: Major rewrite of post
The Burrows-Wheeler Aligner (BWA) is a popular short read alignment program. Here I test the program with an artificial reference sequence. First download and compile the program:
wget http://sourceforge.net/projects/bio-bwa/files/bwa-0.7.9a.tar.bz2 tar -xjf bwa-0.7.9a.tar.bz2 cd bwa-0.7.9a/ make
Let's make up some artificial reference sequence (ref.fa):
>artificial
ACGTACGTACGTACGTAGGTACGTAGGG
ACGTACGTACGTACGTAGGTACGTAGGG
ACGTACGTACGTACGTAGGTACGTAGGG
ACGTACGTACGTACGTAGGTACGTAGGG
ACGTACGTACGTACGTAGGTACGTAGGG
and this read (read.fa):
>tag
ACGTACGTACGTACGTAGGTACGTA
Now let's align the read to the reference:
#index the reference bwa index ref.fa #perform the alignment bwa aln ref.fa read.fa > read.sai #generate the alignment bwa samse ref.fa read.sai read.fa > read.sam #check out the alignment cat read.sam @SQ SN:artificial LN:140 @PG ID:bwa PN:bwa VN:0.7.9a-r786 CL:bwa samse ref.fa read.sai read.fa tag 0 artificial 57 0 25M * 0 0 ACGTACGTACGTACGTAGGTACGTA * XT:A:R NM:i:0 X0:i:5 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:25
The SAM tags starting with an X are BWA specific. The X0 lists the number of best hits which in this case is 5, resulting in a mapping quality score of 0; reads that map equally well to multiple places are assigned a mapping quality of 0. Check out my post on mapping quality, if you are unfamiliar with the concept. The XT tag shows that it is a repeated sequence; a uniquely mapping read will have a U.
If I change the reference (ref2.fa) to:
>artificial
ACGTACGTACGTACGTAGGTACGTAGGG
ACGTACGTACGTACGTAGGTACGTAGGG
ACGTACGTACGTACGTAGGTACGTAGGG
ACGTACGTACGTACGTAGGTACGTAGGG
ACGTACGTACGTACGTAGGTACGTAGGG
ACGTACGTACGTACGTAGGTACGTAGGG
ACGTACGTACGTACGTAGGTACGTAGGG
ACGTACGTACGTACGTAGGTACGTAGGG
ACGTACGTACGTACGTAGGTACGTAGGG
ACGTACGTACGTACGTAGGTACGTAGGG
ACGTACGTACGTACGTAGGTACGTAGGG
ACGTACGTACGTACGTAGGTACGTAGGG
ACGTACGTACGTACGTAGGTACGTAGGG
ACGTACGTACGTACGTAGGTACGTAGGG
ACGTACGTACGTACGTAGGTACGTAGGG
ACGTACGTACGTACGTAGGTACGTAGGG
ACGTACGTACGTACGTAGGTACGTAGGG
ACGTACGTACGTACGTAGGTACGTAGGG
ACGTACGTACGTACGTAGGTACGTAGGG
ACGTACGTACGTACGTAGGTACGTAGGG
and perform the same steps again:
bwa index ref2.fa bwa aln ref2.fa read.fa > read_ref2.sai bwa samse ref2.fa read_ref2.sai read.fa > read_ref2.sam cat read_ref2.sam @SQ SN:artificial LN:560 @PG ID:bwa PN:bwa VN:0.7.9a-r786 CL:bwa samse ref2.fa read_ref2.sai read.fa tag 0 artificial 281 0 25M * 0 0 ACGTACGTACGTACGTAGGTACGTA * XT:A:R NM:i:0 X0:i:20 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:25
Of note is that the X0 tag is now 20 and the mapping position of the read has changed from 57 to 281. Interestingly this is the middle of the reference sequence, which was also the case in the first example. I will create some more examples to see if BWA always picks the middle position out of a list of potential mapping locations.
cat ref.fa >ref GCCGGGCGTGGTGGCTTACGCCTGTAATCC AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA GCCGGGCGTGGTGGCTTACGCCTGTAATCC AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA GCCGGGCGTGGTGGCTTACGCCTGTAATCC AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA GCCGGGCGTGGTGGCTTACGCCTGTAATCC AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA GCCGGGCGTGGTGGCTTACGCCTGTAATCC AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA GCCGGGCGTGGTGGCTTACGCCTGTAATCC AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA GCCGGGCGTGGTGGCTTACGCCTGTAATCC AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA cat read.fa >read GCCGGGCGTGGTGGCTTACGCCTGTAATCC bwa index ref.fa bwa aln ref.fa read.fa > read.sai bwa samse ref.fa read.sai read.fa > read.sam cat read.sam @SQ SN:ref LN:420 @PG ID:bwa PN:bwa VN:0.7.9a-r786 CL:bwa samse ref.fa read.sai read.fa read 0 ref 181 0 30M * 0 0 GCCGGGCGTGGTGGCTTACGCCTGTAATCC * XT:A:R NM:i:0 X0:i:7 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:30
Again it chooses the middle list of alignment possibilities. Is this behaviour true if I have multiple fasta entries in the reference, such as different chromosomes?
cat ref.fa >ref GCCGGGCGTGGTGGCTTACGCCTGTAATCC AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA GCCGGGCGTGGTGGCTTACGCCTGTAATCC AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA GCCGGGCGTGGTGGCTTACGCCTGTAATCC AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA >ref2 GCCGGGCGTGGTGGCTTACGCCTGTAATCC AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA GCCGGGCGTGGTGGCTTACGCCTGTAATCC AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA GCCGGGCGTGGTGGCTTACGCCTGTAATCC AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA GCCGGGCGTGGTGGCTTACGCCTGTAATCC AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA cat read.fa >read GCCGGGCGTGGTGGCTTACGCCTGTAATCC bwa index ref.fa bwa aln ref.fa read.fa > read.sai bwa samse ref.fa read.sai read.fa > read.sam cat read.sam @SQ SN:ref LN:180 @SQ SN:ref2 LN:240 @PG ID:bwa PN:bwa VN:0.7.9a-r786 CL:bwa samse ref.fa read.sai read.fa read 0 ref2 1 0 30M * 0 0 GCCGGGCGTGGTGGCTTACGCCTGTAATCC * XT:A:R NM:i:0 X0:i:7 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:30
At least with these simplistic examples, it seems that if a read is mapped equally well at multiple locations, the reported location is middle location of the list of multiple hit locations.
I had always thought that if a read mapped to multiple locations, a random location would be chosen, as stated in the manual for samse:
Repetitive hits will be randomly chosen.
Or did I do something wrong?

This work is licensed under a Creative Commons
Attribution 4.0 International License.
Hi Dave,
Even if this is an old post, I had similar questions, and I used your post as a starting point. What I found, is that bwa mem randomly assign reads as it should (I used bwa 0.7.16a-r1185-dirty), but it does so in a reproducible manner: if you run 10 times the same multimapping read to the artificial reference, the primary alignment position will be always the same. However, if your read file contains 100,000 identical multimapping reads, they will be distributed more or less equally among the possible hits in the reference genome (see below for code).
Notably, there was a bugfix in Release 0.7.6 (31 Januaray, 2014; “Bugfix: for a repetitive single-end read, the reported hit is not randomly distributed among equally best hits.”). Thus, it is possible that the unexpected behavior that you observed was fixed since your blog post.
Anyway, great blog!
Regards
$ cat ref.fa
>ref1
GCCGGGCGTGGTGGCTTACGCCTGTAATCCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGCCGGGCGTGGTGGCTTACGCCTGTAATCCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGCCGGGCGTGGTGGCTTACGCCTGTAATCCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
>ref2
GCCGGGCGTGGTGGCTTACGCCTGTAATCCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGCCGGGCGTGGTGGCTTACGCCTGTAATCCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGCCGGGCGTGGTGGCTTACGCCTGTAATCCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGCCGGGCGTGGTGGCTTACGCCTGTAATCCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
>ref3
GCCGGGCGTGGTGGCTTACGCCTGTAATCCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGCCGGGCGTGGTGGCTTACGCCTGTAATCCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGCCGGGCGTGGTGGCTTACGCCTGTAATCCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGCCGGGCGTGGTGGCTTACGCCTGTAATCCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGCCGGGCGTGGTGGCTTACGCCTGTAATCCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGCCGGGCGTGGTGGCTTACGCCTGTAATCCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGCCGGGCGTGGTGGCTTACGCCTGTAATCCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGCCGGGCGTGGTGGCTTACGCCTGTAATCCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
$ awk ‘BEGIN{for (i=1;iread” i,”\nGCCGGGCGTGGTGGCTTACGCCTGTAATCC”}}’ > reads.fa
$ bwa mem ref.fa reads.fa | awk ‘{print $3?-“$4}’ | sort -k1,1 | uniq -c
6521 ref1-1
6659 ref1-121
6609 ref1-61
6675 ref2-1
6681 ref2-121
6701 ref2-181
6701 ref2-61
6595 ref3-1
6722 ref3-121
6759 ref3-181
6693 ref3-241
6726 ref3-301
6666 ref3-361
6548 ref3-421
6744 ref3-61
Thanks for the comment and good to know they have addressed this.