Getting started with Picard

Updated hyperlinks on the 2015 January 26th; please comment if you find any more dead links.

Picard is a suite of Java-based command-line utilities that manipulate SAM/BAM files. Currently, I'm analysing some paired-end libraries and I wanted to calculate the average insert size based on the alignments; that's how I found Picard. While reading the documentation I realised that Picard has a lot more to offer and hence I'm writing this post to look into specific tools that I thought would be useful.

To get started, download Picard and an alignment (whole genome shotgun sequencing) from the 1000 genomes project that we can use to test Picard:

Continue reading

Bowtie and multimapping reads

Updated 2014 June 8th

I first tried this with BWA. Now I'll try it with Bowtie.

git clone https://github.com/BenLangmead/bowtie.git
cd bowtie
make

Consider this reference sequence, which is the sequence "ACGTACGTACGTACGTAGGTACGTAGGG" repeated 20 times:

>artificial
ACGTACGTACGTACGTAGGTACGTAGGG
ACGTACGTACGTACGTAGGTACGTAGGG
ACGTACGTACGTACGTAGGTACGTAGGG
ACGTACGTACGTACGTAGGTACGTAGGG
ACGTACGTACGTACGTAGGTACGTAGGG
ACGTACGTACGTACGTAGGTACGTAGGG
ACGTACGTACGTACGTAGGTACGTAGGG
ACGTACGTACGTACGTAGGTACGTAGGG
ACGTACGTACGTACGTAGGTACGTAGGG
ACGTACGTACGTACGTAGGTACGTAGGG
ACGTACGTACGTACGTAGGTACGTAGGG
ACGTACGTACGTACGTAGGTACGTAGGG
ACGTACGTACGTACGTAGGTACGTAGGG
ACGTACGTACGTACGTAGGTACGTAGGG
ACGTACGTACGTACGTAGGTACGTAGGG
ACGTACGTACGTACGTAGGTACGTAGGG
ACGTACGTACGTACGTAGGTACGTAGGG
ACGTACGTACGTACGTAGGTACGTAGGG
ACGTACGTACGTACGTAGGTACGTAGGG
ACGTACGTACGTACGTAGGTACGTAGGG

and this read:

>tag
ACGTACGTACGTACGTAGGTACGTA

The next steps are to build an index and then align our read to the index:

#build index
bowtie-build ref.fa ref

#if you want to scroll up and down the usage
#bowtie 2>&1 | less
#here are what the parameters mean
#-k <int>           report up to <int> good alignments per read (default: 1)
#-f                 query input files are (multi-)FASTA .fa/.mfa
#-S/--sam           write hits in SAM format
bowtie -k 40 -f -S ref read.fa
@HD	VN:1.0	SO:unsorted
@SQ	SN:artificial	LN:560
@PG	ID:Bowtie	VN:1.0.1	CL:"bowtie -k 40 -f -S ref read.fa"
tag	0	artificial	29	255	25M	*	0	0	ACGTACGTACGTACGTAGGTACGTA	IIIIIIIIIIIIIIIIIIIIIIIII	XA:i:0	MD:Z:25	NM:i:0
tag	0	artificial	57	255	25M	*	0	0	ACGTACGTACGTACGTAGGTACGTA	IIIIIIIIIIIIIIIIIIIIIIIII	XA:i:0	MD:Z:25	NM:i:0
tag	0	artificial	85	255	25M	*	0	0	ACGTACGTACGTACGTAGGTACGTA	IIIIIIIIIIIIIIIIIIIIIIIII	XA:i:0	MD:Z:25	NM:i:0
tag	0	artificial	113	255	25M	*	0	0	ACGTACGTACGTACGTAGGTACGTA	IIIIIIIIIIIIIIIIIIIIIIIII	XA:i:0	MD:Z:25	NM:i:0
tag	0	artificial	141	255	25M	*	0	0	ACGTACGTACGTACGTAGGTACGTA	IIIIIIIIIIIIIIIIIIIIIIIII	XA:i:0	MD:Z:25	NM:i:0
tag	0	artificial	169	255	25M	*	0	0	ACGTACGTACGTACGTAGGTACGTA	IIIIIIIIIIIIIIIIIIIIIIIII	XA:i:0	MD:Z:25	NM:i:0
tag	0	artificial	197	255	25M	*	0	0	ACGTACGTACGTACGTAGGTACGTA	IIIIIIIIIIIIIIIIIIIIIIIII	XA:i:0	MD:Z:25	NM:i:0
tag	0	artificial	225	255	25M	*	0	0	ACGTACGTACGTACGTAGGTACGTA	IIIIIIIIIIIIIIIIIIIIIIIII	XA:i:0	MD:Z:25	NM:i:0
tag	0	artificial	253	255	25M	*	0	0	ACGTACGTACGTACGTAGGTACGTA	IIIIIIIIIIIIIIIIIIIIIIIII	XA:i:0	MD:Z:25	NM:i:0
tag	0	artificial	281	255	25M	*	0	0	ACGTACGTACGTACGTAGGTACGTA	IIIIIIIIIIIIIIIIIIIIIIIII	XA:i:0	MD:Z:25	NM:i:0
tag	0	artificial	309	255	25M	*	0	0	ACGTACGTACGTACGTAGGTACGTA	IIIIIIIIIIIIIIIIIIIIIIIII	XA:i:0	MD:Z:25	NM:i:0
tag	0	artificial	337	255	25M	*	0	0	ACGTACGTACGTACGTAGGTACGTA	IIIIIIIIIIIIIIIIIIIIIIIII	XA:i:0	MD:Z:25	NM:i:0
tag	0	artificial	365	255	25M	*	0	0	ACGTACGTACGTACGTAGGTACGTA	IIIIIIIIIIIIIIIIIIIIIIIII	XA:i:0	MD:Z:25	NM:i:0
tag	0	artificial	393	255	25M	*	0	0	ACGTACGTACGTACGTAGGTACGTA	IIIIIIIIIIIIIIIIIIIIIIIII	XA:i:0	MD:Z:25	NM:i:0
tag	0	artificial	421	255	25M	*	0	0	ACGTACGTACGTACGTAGGTACGTA	IIIIIIIIIIIIIIIIIIIIIIIII	XA:i:0	MD:Z:25	NM:i:0
tag	0	artificial	449	255	25M	*	0	0	ACGTACGTACGTACGTAGGTACGTA	IIIIIIIIIIIIIIIIIIIIIIIII	XA:i:0	MD:Z:25	NM:i:0
tag	0	artificial	477	255	25M	*	0	0	ACGTACGTACGTACGTAGGTACGTA	IIIIIIIIIIIIIIIIIIIIIIIII	XA:i:0	MD:Z:25	NM:i:0
tag	0	artificial	505	255	25M	*	0	0	ACGTACGTACGTACGTAGGTACGTA	IIIIIIIIIIIIIIIIIIIIIIIII	XA:i:0	MD:Z:25	NM:i:0
tag	0	artificial	533	255	25M	*	0	0	ACGTACGTACGTACGTAGGTACGTA	IIIIIIIIIIIIIIIIIIIIIIIII	XA:i:0	MD:Z:25	NM:i:0
tag	0	artificial	1	255	25M	*	0	0	ACGTACGTACGTACGTAGGTACGTA	IIIIIIIIIIIIIIIIIIIIIIIII	XA:i:0	MD:Z:25	NM:i:0
tag	0	artificial	289	255	25M	*	0	0	ACGTACGTACGTACGTAGGTACGTA	IIIIIIIIIIIIIIIIIIIIIIIII	XA:i:2	MD:Z:9G9G5	NM:i:2
tag	0	artificial	261	255	25M	*	0	0	ACGTACGTACGTACGTAGGTACGTA	IIIIIIIIIIIIIIIIIIIIIIIII	XA:i:2	MD:Z:9G9G5	NM:i:2
tag	0	artificial	233	255	25M	*	0	0	ACGTACGTACGTACGTAGGTACGTA	IIIIIIIIIIIIIIIIIIIIIIIII	XA:i:2	MD:Z:9G9G5	NM:i:2
tag	0	artificial	205	255	25M	*	0	0	ACGTACGTACGTACGTAGGTACGTA	IIIIIIIIIIIIIIIIIIIIIIIII	XA:i:2	MD:Z:9G9G5	NM:i:2
tag	0	artificial	177	255	25M	*	0	0	ACGTACGTACGTACGTAGGTACGTA	IIIIIIIIIIIIIIIIIIIIIIIII	XA:i:2	MD:Z:9G9G5	NM:i:2
tag	0	artificial	149	255	25M	*	0	0	ACGTACGTACGTACGTAGGTACGTA	IIIIIIIIIIIIIIIIIIIIIIIII	XA:i:2	MD:Z:9G9G5	NM:i:2
tag	0	artificial	121	255	25M	*	0	0	ACGTACGTACGTACGTAGGTACGTA	IIIIIIIIIIIIIIIIIIIIIIIII	XA:i:2	MD:Z:9G9G5	NM:i:2
tag	0	artificial	93	255	25M	*	0	0	ACGTACGTACGTACGTAGGTACGTA	IIIIIIIIIIIIIIIIIIIIIIIII	XA:i:2	MD:Z:9G9G5	NM:i:2
tag	0	artificial	65	255	25M	*	0	0	ACGTACGTACGTACGTAGGTACGTA	IIIIIIIIIIIIIIIIIIIIIIIII	XA:i:2	MD:Z:9G9G5	NM:i:2
tag	0	artificial	37	255	25M	*	0	0	ACGTACGTACGTACGTAGGTACGTA	IIIIIIIIIIIIIIIIIIIIIIIII	XA:i:2	MD:Z:9G9G5	NM:i:2
tag	0	artificial	9	255	25M	*	0	0	ACGTACGTACGTACGTAGGTACGTA	IIIIIIIIIIIIIIIIIIIIIIIII	XA:i:2	MD:Z:9G9G5	NM:i:2
tag	0	artificial	513	255	25M	*	0	0	ACGTACGTACGTACGTAGGTACGTA	IIIIIIIIIIIIIIIIIIIIIIIII	XA:i:2	MD:Z:9G9G5	NM:i:2
tag	0	artificial	485	255	25M	*	0	0	ACGTACGTACGTACGTAGGTACGTA	IIIIIIIIIIIIIIIIIIIIIIIII	XA:i:2	MD:Z:9G9G5	NM:i:2
tag	0	artificial	457	255	25M	*	0	0	ACGTACGTACGTACGTAGGTACGTA	IIIIIIIIIIIIIIIIIIIIIIIII	XA:i:2	MD:Z:9G9G5	NM:i:2
tag	0	artificial	429	255	25M	*	0	0	ACGTACGTACGTACGTAGGTACGTA	IIIIIIIIIIIIIIIIIIIIIIIII	XA:i:2	MD:Z:9G9G5	NM:i:2
tag	0	artificial	401	255	25M	*	0	0	ACGTACGTACGTACGTAGGTACGTA	IIIIIIIIIIIIIIIIIIIIIIIII	XA:i:2	MD:Z:9G9G5	NM:i:2
tag	0	artificial	373	255	25M	*	0	0	ACGTACGTACGTACGTAGGTACGTA	IIIIIIIIIIIIIIIIIIIIIIIII	XA:i:2	MD:Z:9G9G5	NM:i:2
tag	0	artificial	345	255	25M	*	0	0	ACGTACGTACGTACGTAGGTACGTA	IIIIIIIIIIIIIIIIIIIIIIIII	XA:i:2	MD:Z:9G9G5	NM:i:2
tag	0	artificial	317	255	25M	*	0	0	ACGTACGTACGTACGTAGGTACGTA	IIIIIIIIIIIIIIIIIIIIIIIII	XA:i:2	MD:Z:9G9G5	NM:i:2
# reads processed: 1
# reads with at least one reported alignment: 1 (100.00%)
# reads that failed to align: 0 (0.00%)
Reported 39 alignments to 1 output stream(s)

To report all the best alignments:

bowtie -f -a -S --best --strata ref read.fa
@HD	VN:1.0	SO:unsorted
@SQ	SN:artificial	LN:560
@PG	ID:Bowtie	VN:1.0.1	CL:"bowtie -f -a -S --best --strata ref read.fa"
tag	0	artificial	505	255	25M	*	0	0	ACGTACGTACGTACGTAGGTACGTA	IIIIIIIIIIIIIIIIIIIIIIIII	XA:i:0	MD:Z:25	NM:i:0
tag	0	artificial	477	255	25M	*	0	0	ACGTACGTACGTACGTAGGTACGTA	IIIIIIIIIIIIIIIIIIIIIIIII	XA:i:0	MD:Z:25	NM:i:0
tag	0	artificial	449	255	25M	*	0	0	ACGTACGTACGTACGTAGGTACGTA	IIIIIIIIIIIIIIIIIIIIIIIII	XA:i:0	MD:Z:25	NM:i:0
tag	0	artificial	421	255	25M	*	0	0	ACGTACGTACGTACGTAGGTACGTA	IIIIIIIIIIIIIIIIIIIIIIIII	XA:i:0	MD:Z:25	NM:i:0
tag	0	artificial	393	255	25M	*	0	0	ACGTACGTACGTACGTAGGTACGTA	IIIIIIIIIIIIIIIIIIIIIIIII	XA:i:0	MD:Z:25	NM:i:0
tag	0	artificial	365	255	25M	*	0	0	ACGTACGTACGTACGTAGGTACGTA	IIIIIIIIIIIIIIIIIIIIIIIII	XA:i:0	MD:Z:25	NM:i:0
tag	0	artificial	337	255	25M	*	0	0	ACGTACGTACGTACGTAGGTACGTA	IIIIIIIIIIIIIIIIIIIIIIIII	XA:i:0	MD:Z:25	NM:i:0
tag	0	artificial	309	255	25M	*	0	0	ACGTACGTACGTACGTAGGTACGTA	IIIIIIIIIIIIIIIIIIIIIIIII	XA:i:0	MD:Z:25	NM:i:0
tag	0	artificial	281	255	25M	*	0	0	ACGTACGTACGTACGTAGGTACGTA	IIIIIIIIIIIIIIIIIIIIIIIII	XA:i:0	MD:Z:25	NM:i:0
tag	0	artificial	253	255	25M	*	0	0	ACGTACGTACGTACGTAGGTACGTA	IIIIIIIIIIIIIIIIIIIIIIIII	XA:i:0	MD:Z:25	NM:i:0
tag	0	artificial	225	255	25M	*	0	0	ACGTACGTACGTACGTAGGTACGTA	IIIIIIIIIIIIIIIIIIIIIIIII	XA:i:0	MD:Z:25	NM:i:0
tag	0	artificial	197	255	25M	*	0	0	ACGTACGTACGTACGTAGGTACGTA	IIIIIIIIIIIIIIIIIIIIIIIII	XA:i:0	MD:Z:25	NM:i:0
tag	0	artificial	169	255	25M	*	0	0	ACGTACGTACGTACGTAGGTACGTA	IIIIIIIIIIIIIIIIIIIIIIIII	XA:i:0	MD:Z:25	NM:i:0
tag	0	artificial	141	255	25M	*	0	0	ACGTACGTACGTACGTAGGTACGTA	IIIIIIIIIIIIIIIIIIIIIIIII	XA:i:0	MD:Z:25	NM:i:0
tag	0	artificial	113	255	25M	*	0	0	ACGTACGTACGTACGTAGGTACGTA	IIIIIIIIIIIIIIIIIIIIIIIII	XA:i:0	MD:Z:25	NM:i:0
tag	0	artificial	85	255	25M	*	0	0	ACGTACGTACGTACGTAGGTACGTA	IIIIIIIIIIIIIIIIIIIIIIIII	XA:i:0	MD:Z:25	NM:i:0
tag	0	artificial	57	255	25M	*	0	0	ACGTACGTACGTACGTAGGTACGTA	IIIIIIIIIIIIIIIIIIIIIIIII	XA:i:0	MD:Z:25	NM:i:0
tag	0	artificial	29	255	25M	*	0	0	ACGTACGTACGTACGTAGGTACGTA	IIIIIIIIIIIIIIIIIIIIIIIII	XA:i:0	MD:Z:25	NM:i:0
tag	0	artificial	1	255	25M	*	0	0	ACGTACGTACGTACGTAGGTACGTA	IIIIIIIIIIIIIIIIIIIIIIIII	XA:i:0	MD:Z:25	NM:i:0
tag	0	artificial	533	255	25M	*	0	0	ACGTACGTACGTACGTAGGTACGTA	IIIIIIIIIIIIIIIIIIIIIIIII	XA:i:0	MD:Z:25	NM:i:0
# reads processed: 1
# reads with at least one reported alignment: 1 (100.00%)
# reads that failed to align: 0 (0.00%)
Reported 20 alignments to 1 output stream(s)

If I want to exclude tags mapping to m places:

bowtie -f -a -m 19 -S --best --strata ref read.fa
@HD	VN:1.0	SO:unsorted
@SQ	SN:artificial	LN:560
@PG	ID:Bowtie	VN:1.0.1	CL:"bowtie -f -a -m 19 -S --best --strata ref read.fa"
tag	4	*	0	0	*	*	0	0	ACGTACGTACGTACGTAGGTACGTA	IIIIIIIIIIIIIIIIIIIIIIIII	XM:i:19
# reads processed: 1
# reads with at least one reported alignment: 0 (0.00%)
# reads that failed to align: 0 (0.00%)
# reads with alignments suppressed due to -m: 1 (100.00%)
No alignments

I could not find an option for reporting all alignments in BWA (which may exist?); the default behaviour for BWA is to report one random location for multimapping tags. Using the -a -m 10 -S --best --strata parameters, does exactly what I want; report all alignments but keep only the best hits, however if a tag maps to more than 10 places mark it as unmapped.

See the Bowtie manual for more information and usage examples.

Mapping qualities

Updated 2014 December 17th

Current high throughput sequencers produces reads that are short; for example the HiSeq2000 produces millions of reads that are 50 and 100 bp long. To align such short reads with high speed and accuracy, many short read alignment programs have been developed, such as BWA. The major limitation is the length of the sequenced reads because many eukaryotic genomes are repetitive and therefore it is difficult to accurately map these reads. Because of this, alignment programs have mapping qualities for each read that is mapped to the reference genome. A mapping quality is basically the probability that a read is aligned in the wrong place (i.e. phred-scaled posterior probability that the mapping position of this read is incorrect). The probability is calculated as:

where q is the quality. For example a mapping quality of 40 = 10 to the power of -4, which is 0.0001, which means there is a 0.01 percent chance that the read is aligned incorrectly.

Base calling errors with respect to mapping qualities

Sequencers make base calling mistakes and this complicates matters. To illustrate how this affects the mapping qualities using BWA, I will use an example I came across in SEQanswers. First let's examine mapping qualities when a read maps to a specific region without suboptimal hits:

#what version of BWA
bwa 2>&1 | head -4

Program: bwa (alignment via Burrows-Wheeler transformation)
Version: 0.7.10-r789
Contact: Heng Li <lh3@sanger.ac.uk>

cat ref.fa 
>ref
ACGTACGTACGTACGTACGTACGTAGGG
CCCCCCCCCCCCCCCCCCCCCCCCCCCC
CCCCCCCCCCCCCCCCCCCCCCCCCCCC
CCCCCCCCCCCCCCCCCCCCCCCCCCCC
CCCCCCCCCCCCCCCCCCCCCCCCCCCC
CCCCCCCCCCCCCCCCCCCCCCCCCCCC
CCCCCCCCCCCCCCCCCCCCCCCCCCCC

cat read.fa 
>read
ACGTACGTACGTACGTACGTACGTA

Now to do the mapping:

bwa index ref.fa
bwa aln ref.fa read.fa > read.sai
bwa samse ref.fa read.sai read.fa > read.sam

#scroll to see the rest of the SAM file
cat read.sam 
@SQ	SN:ref	LN:196
@PG	ID:bwa	PN:bwa	VN:0.7.10-r789	CL:bwa samse ref.fa read.sai read.fa
read	0	ref	1	37	25M	*	0	0	ACGTACGTACGTACGTACGTACGTA	*	XT:A:U	NM:i:0	X0:i:1	X1:i:0	XM:i:0	XO:i:0	XG:i:0	MD:Z:25

Mapping the read to our reference, BWA returns a mapping quality of 37 (which is actually the highest mapping quality BWA returns).

Next let's create an example with suboptimal hits. Below is a reference that contains five identical stretches of 28 mers and one 28 mer with a single mismatch (in red) compared to the other five:

>ref2
ACGTACGTACGTACGTACGTACGTAGGG
ACGTACGTACGTACGTAGGTACGTAGGG
ACGTACGTACGTACGTAGGTACGTAGGG
ACGTACGTACGTACGTAGGTACGTAGGG
ACGTACGTACGTACGTAGGTACGTAGGG
ACGTACGTACGTACGTAGGTACGTAGGG

Let's map a read from the single mismatch stretch to this reference:

cat ref2.fa 
>ref2
ACGTACGTACGTACGTACGTACGTAGGG
ACGTACGTACGTACGTAGGTACGTAGGG
ACGTACGTACGTACGTAGGTACGTAGGG
ACGTACGTACGTACGTAGGTACGTAGGG
ACGTACGTACGTACGTAGGTACGTAGGG
ACGTACGTACGTACGTAGGTACGTAGGG

cat read2.fa 
>read2
ACGTACGTACGTACGTACGTACGTA

bwa index ref2.fa
bwa aln ref2.fa read2.fa > read2.sai
bwa samse ref2.fa read2.sai read2.fa > read2.sam

#scroll to see the rest of the SAM file
cat read2.sam 
@SQ	SN:ref2	LN:168
@PG	ID:bwa	PN:bwa	VN:0.7.10-r789	CL:bwa samse ref2.fa read2.sai read2.fa
read2	0	ref2	1	16	25M	*	0	0	ACGTACGTACGTACGTACGTACGTA	*	XT:A:U	NM:i:0	X0:i:1	X1:i:5	XM:i:0	XO:i:0	XG:i:0	MD:Z:25

The mapping quality of the read in the second example is 16, which has a probability of of mapping to the wrong place. Even though the read maps uniquely in the reference, its mapping quality is 16 and not 37. The BWA specific tags in the SAM file provides some nice additional information:

XT Type: Unique/Repeat/N/Mate-sw
X0 Number of best hits
X1 Number of suboptimal hits found by BWA

From the BWA tag information we can quickly deduce whether a read is aligned uniquely; in this case the XT:A:U indicates that it was aligned uniquely. In addition, the X1:i:5 tag indicates that there were 5 suboptimal hits.

Mapping qualities when considering base calling errors

To model base calling errors we can use the Binomial distribution; if I expect there to be 1 base calling error in 100 bps, I can calculate the probability of an error for a read of 25 nt as such using R

#Probability of success (1 error in 100 bases) = 0.99
#Number of trials (each base is a trial) = 25

#no base calling errors, i.e. 25 successes
dbinom(x = 25, size=25, prob=0.99) 
#[1] 0.7778214

#one base calling error, i.e. 24 successes
dbinom(x = 24, size=25, prob=0.99) 
#[1] 0.1964195

#two base calling errors, i.e. 23 successes
dbinom(x = 23, size=25, prob=0.99) 
#[1] 0.02380843

If we expect 1 base calling error in 100 bps, the probability of making two base calling errors in 25 bps is quite low. Using the formula from the SEQanswers post that calculates the posterior probability that the best alignment is actually correct:

Calculating this in R:

#the posterior probability that the best alignment is correct
p = 0.99
dbinom(x = 25, size=25, prob=p) / (dbinom(x = 25, size=25, prob=p) + (5 * dbinom(x = 24, size=25, prob=p)))
#[1] 0.4419643

In reality base calling is much more accurate than 1 error in 100 bases, which is a Phred quality score of 20. If we changed the base calling error rate to 1 in 1000 (Phred score of 30):

p = 0.999
dbinom(x = 25, size=25, prob=p) / (dbinom(x = 25, size=25, prob=p) + (5 * dbinom(x = 24, size=25, prob=p)))
#[1] 0.88879

then the posterior probability that the best alignment is correct improves to 0.88879. Using a base calling error rate of 1 in 10000 (Phred score of 40):

p = 0.9999
dbinom(x = 25, size=25, prob=p) / (dbinom(x = 25, size=25, prob=p) + (5 * dbinom(x = 24, size=25, prob=p)))
#[1] 0.9876531

improves the probability to 0.9876531, which is a ~0.012 probability that the alignment is incorrect, which is around the same ball park to the BWA mapping quality of 16, which is a 0.025 probability that the alignment is incorrect.

Does BWA make use of base calling qualities?

When I included base calling qualities to the read

@read
ACGTACGTACGTACGTACGTACGTA
+
!!!!!!!!!!!!!!!!!!!!!!!!!

I still get the same mapping quality of 16 with BWA, indicating that mapping qualities are not used by BWA:

tag 0 artificial 1 16 25M * 0 0 ACGTACGTACGTACGTACGTACGTA !!!!!!!!!!!!!!!!!!!!!!!!! XT:A:U NM:i:0 X0:i:1 X1:i:5 XM:i:0 XO:i:0 XG:i:0 MD:Z:25

This was confirmed when I examined the BWA manual, which mentioned that "Base quality is NOT considered in evaluating hits."

Perl and SAM

Lincoln Stein has written a bunch of modules to deal with SAM/BAM files. Check out the CPAN module. If you are having trouble installing Bio::DB::Sam, you may have to recompile SAMTools with the following command:

make CXXFLAGS=-fPIC CFLAGS=-fPIC CPPFLAGS=-fPIC

To install the Perl module on a machine where you don't have root access, follow these instructions.

Using this module, I recreated the alignment by parsing the CIGAR string and then went through the alignment to collect the edit statistics. The CIGAR line indicates the number of Matches/Mismatches, Insertions and Deletions in each alignment. The MD tag gives a better resolution of the nucleotides involved but using the module the reference sequence is generated for you already.

Some examples of how the CIGAR string and the MD tag annotates alignments:

No insertions or deletions:
Positive strand match CGATACGGGGACATCCGGCCTGCTCCTTCTCACATG CIGAR: 36M MD:Z:1A0C0C0C1T0C0T27
Reference genome = CACCCCTCTGACATCCGGCCTGCTCCTTCTCACATG

CGATACGGGGACATCCGGCCTGCTCCTTCTCACATG 
 ---- ---
CACCCCTCTGACATCCGGCCTGCTCCTTCTCACATG

Insertion:
Positive strand match GAGACGGGGTGACATCCGGCCTGCTCCTTCTCACAT CIGAR: 6M1I29M MD:Z:0C1C0C1C0T0C27
Reference genome = CACCCCTCTGACATCCGGCCTGCTCCTTCTCACAT

GAGACGGGGTGACATCCGGCCTGCTCCTTCTCACAT
- -- - --
CACCCC-TCTGACATCCGGCCTGCTCCTTCTCACAT

Deletion:
Positive strand match AGTGATGGGGGGGTTCCAGGTGGAGACGAGGACTCC CIGAR: 9M9D27M MD:Z:2G0A5^ATGATGTCA27
Reference genome = AGGAATGGGATGATGTCAGGGGTTCCAGGTGGAGACGAGGACTCC

AGTGATGGG---------GGGGTTCCAGGTGGAGACGAGGACTCC
  --     ---------
AGGAATGGGATGATGTCAGGGGTTCCAGGTGGAGACGAGGACTCC

And all three together:
Positive strand match AGTGATGGGAGGATGTCTCGTCTGTGAGTTACAGCA CIGAR: 2M1I7M6D26M MD:Z:3C3T1^GCTCAG26
Reference genome = AGGCTGGTAGCTCAGGGATGTCTCGTCTGTGAGTTACAGCA

AGTGATGGGA------GGATGTCTCGTCTGTGAGTTACAGCA
    -   -
AG-GCTGGTAGCTCAGGGATGTCTCGTCTGTGAGTTACAGCA

And below is my ugly code to output the frequency of mismatches/deletions/insertions of each base position in a tag alignment.

#!/usr/bin/perl

use strict;
use warnings;

use Bio::DB::Sam;
#use Data::Dumper;

my $usage = "Usage $0 <infile.bam>\n";
my $infile = shift or die $usage;

my $sam = Bio::DB::Sam->new(-bam  => $infile);

my %error_profile = ();

#store all the targets which should be all assembled chromosomes
my @targets = $sam->seq_ids;

@targets = 'chrM'; #comment out this line if you are game to test it on all alignments
my $total_tag = '0';

#iterate through each target individually, since each target is stored in memory
foreach my $target (@targets){
   warn "$target\n";
   my @alignments = $sam->get_features_by_location(-seq_id => $target);

   for my $a (@alignments) {
      ++$total_tag;

      #tag id
      my $id  = $a->name;

      #alignment information in the reference sequence
      #my $seqid  = $a->seq_id;
      #my $start  = $a->start;
      #my $end    = $a->end;
      #strand is stored as -1 or 1
      my $strand = $a->strand;
      $strand = '+' if $strand =~ /^1$/;
      $strand = '-' if $strand =~ /^-1$/;
      my $cigar  = $a->cigar_str;
      my $md_tag = $a->get_tag_values('MD');
      my $nm_tag = $a->get_tag_values('NM');

      die "$id MD tag not defined on line $." if ($md_tag =~ /^$/);
      die "$id NM tag not defined on line $." if ($nm_tag =~ /^$/);

      #alignment information in the query sequence
      #my $query_start = $a->query->start;
      #my $query_end   = $a->query->end;
      my $ref_dna   = $a->dna;        # reference sequence bases
      my $query_dna = $a->query->dna; # query sequence bases
      my @scores    = $a->qscore;     # per-base quality scores
      my $match_qual= $a->qual;       # quality of the match

      #store reference and query into separate variable for manipulation later
      my $reference = $ref_dna;
      my $query = $query_dna;

      #from the CIGAR string, fill in the insertions and deletions
      my $position = '0';
      while ($cigar !~ /^$/){
         if ($cigar =~ /^([0-9]+[MIDS])/){
            my $cigar_part = $1;
            if ($cigar_part =~ /(\d+)M/){
               $position += $1;
            } elsif ($cigar_part =~ /(\d+)I/){
               my $insertion = '-' x $1;
               substr($reference,$position,0,$insertion);
               $position += $1;
            } elsif ($cigar_part =~ /(\d+)D/){
               my $insertion = '-' x $1;
               substr($query,$position,0,$insertion);
               $position += $1;
            } elsif ($cigar_part =~ /(\d+)S/){
               die "Not ready for this!\n";
               #my $insertion = 'x' x $1;
               #substr($new_ref,$position,0,$insertion);
               #$position += $1;
            }
            $cigar =~ s/$cigar_part//;
         } else {
            die "Unexpected cigar: $id $cigar\n";
         }
      }

      #in the bam files I process, the original tags are reverse complemented
      #re-reverse complement to see the original tag
      if ($strand eq '-'){
         $query = rev_com($query);
         $reference = rev_com($reference);
      }

      my $ed = '0';
      for (my $i =0; $i < length($reference); ++$i){
         my $q = substr($query,$i,1);
         my $r = substr($reference,$i,1);
         if ($q ne $r){
            ++$ed;
            #annotated as a deletion into the reference sequence
            if ($q eq '-'){
               if (exists $error_profile{$i}{'deletion'}){
                  $error_profile{$i}{'deletion'}++;
               } else {
                  $error_profile{$i}{'deletion'} = '1';
               }
            }
            #annotated as an insertion into the reference sequence
            elsif ($r eq '-'){
               if (exists $error_profile{$i}{'insertion'}){
                  $error_profile{$i}{'insertion'}++;
               } else {
                  $error_profile{$i}{'insertion'} = '1';
               }
            }
            #mismatch
            else {
               if ($q eq 'A'){
                  if (exists $error_profile{$i}{$q}){
                     $error_profile{$i}{$q}++;
                  } else {
                     $error_profile{$i}{$q} = '1';
                  }
               }
               elsif ($q eq 'C'){
                  if (exists $error_profile{$i}{$q}){
                     $error_profile{$i}{$q}++;
                  } else {
                     $error_profile{$i}{$q} = '1';
                  }
               }
               elsif ($q eq 'G'){
                  if (exists $error_profile{$i}{$q}){
                     $error_profile{$i}{$q}++;
                  } else {
                     $error_profile{$i}{$q} = '1';
                  }
               }
               elsif ($q eq 'T'){
                  if (exists $error_profile{$i}{$q}){
                     $error_profile{$i}{$q}++;
                  } else {
                     $error_profile{$i}{$q} = '1';
                  }
               }
               elsif ($q eq 'N'){
                  if (exists $error_profile{$i}{$q}){
                     $error_profile{$i}{$q}++;
                  } else {
                     $error_profile{$i}{$q} = '1';
                  }
               }
               else {
                  die "That was unexpected q = $q que = $query ref = $reference\n";
               }
            }
         }
      }
      #double check
      die "Error in calculating edit distance nm: $nm_tag ed: $ed\n" if $ed ne $nm_tag;

      #print "$id $strand $match_qual $nm_tag $md_tag\n";
      #print "Que: $query\nRef: $reference\n\n";

   }
}

foreach my $base (sort {$a <=> $b} keys %error_profile){
   print "$base\n";
   foreach my $type (keys %{$error_profile{$base}}){
      print "\t$type\t$error_profile{$base}{$type}\n";
   }
}

print "Total tag = $total_tag\n";

exit(0);

sub rev_com {
   my ($seq) = @_;
   $seq = reverse($seq);
   $seq =~ tr/ACGT/TGCA/;
   return($seq);
}

__END__

There are definitely much better ways of doing this. Actually, there is a tool called SAMStat that does this already (and more) if you open up the html file with the results, so use that instead.