Lincoln Stein has written a bunch of modules to deal with SAM/BAM files. Check out the CPAN module. If you are having trouble installing Bio::DB::Sam, you may have to recompile SAMTools with the following command:
make CXXFLAGS=-fPIC CFLAGS=-fPIC CPPFLAGS=-fPIC
To install the Perl module on a machine where you don't have root access, follow these instructions.
Using this module, I recreated the alignment by parsing the CIGAR string and then went through the alignment to collect the edit statistics. The CIGAR line indicates the number of Matches/Mismatches, Insertions and Deletions in each alignment. The MD tag gives a better resolution of the nucleotides involved but using the module the reference sequence is generated for you already.
Some examples of how the CIGAR string and the MD tag annotates alignments:
No insertions or deletions:
Positive strand match CGATACGGGGACATCCGGCCTGCTCCTTCTCACATG CIGAR: 36M MD:Z:1A0C0C0C1T0C0T27
Reference genome = CACCCCTCTGACATCCGGCCTGCTCCTTCTCACATG
CGATACGGGGACATCCGGCCTGCTCCTTCTCACATG ---- --- CACCCCTCTGACATCCGGCCTGCTCCTTCTCACATG
Insertion:
Positive strand match GAGACGGGGTGACATCCGGCCTGCTCCTTCTCACAT CIGAR: 6M1I29M MD:Z:0C1C0C1C0T0C27
Reference genome = CACCCCTCTGACATCCGGCCTGCTCCTTCTCACAT
GAGACGGGGTGACATCCGGCCTGCTCCTTCTCACAT - -- - -- CACCCC-TCTGACATCCGGCCTGCTCCTTCTCACAT
Deletion:
Positive strand match AGTGATGGGGGGGTTCCAGGTGGAGACGAGGACTCC CIGAR: 9M9D27M MD:Z:2G0A5^ATGATGTCA27
Reference genome = AGGAATGGGATGATGTCAGGGGTTCCAGGTGGAGACGAGGACTCC
AGTGATGGG---------GGGGTTCCAGGTGGAGACGAGGACTCC -- --------- AGGAATGGGATGATGTCAGGGGTTCCAGGTGGAGACGAGGACTCC
And all three together:
Positive strand match AGTGATGGGAGGATGTCTCGTCTGTGAGTTACAGCA CIGAR: 2M1I7M6D26M MD:Z:3C3T1^GCTCAG26
Reference genome = AGGCTGGTAGCTCAGGGATGTCTCGTCTGTGAGTTACAGCA
AGTGATGGGA------GGATGTCTCGTCTGTGAGTTACAGCA
- -
AG-GCTGGTAGCTCAGGGATGTCTCGTCTGTGAGTTACAGCA
And below is my ugly code to output the frequency of mismatches/deletions/insertions of each base position in a tag alignment.
#!/usr/bin/perl
use strict;
use warnings;
use Bio::DB::Sam;
#use Data::Dumper;
my $usage = "Usage $0 <infile.bam>\n";
my $infile = shift or die $usage;
my $sam = Bio::DB::Sam->new(-bam => $infile);
my %error_profile = ();
#store all the targets which should be all assembled chromosomes
my @targets = $sam->seq_ids;
@targets = 'chrM'; #comment out this line if you are game to test it on all alignments
my $total_tag = '0';
#iterate through each target individually, since each target is stored in memory
foreach my $target (@targets){
warn "$target\n";
my @alignments = $sam->get_features_by_location(-seq_id => $target);
for my $a (@alignments) {
++$total_tag;
#tag id
my $id = $a->name;
#alignment information in the reference sequence
#my $seqid = $a->seq_id;
#my $start = $a->start;
#my $end = $a->end;
#strand is stored as -1 or 1
my $strand = $a->strand;
$strand = '+' if $strand =~ /^1$/;
$strand = '-' if $strand =~ /^-1$/;
my $cigar = $a->cigar_str;
my $md_tag = $a->get_tag_values('MD');
my $nm_tag = $a->get_tag_values('NM');
die "$id MD tag not defined on line $." if ($md_tag =~ /^$/);
die "$id NM tag not defined on line $." if ($nm_tag =~ /^$/);
#alignment information in the query sequence
#my $query_start = $a->query->start;
#my $query_end = $a->query->end;
my $ref_dna = $a->dna; # reference sequence bases
my $query_dna = $a->query->dna; # query sequence bases
my @scores = $a->qscore; # per-base quality scores
my $match_qual= $a->qual; # quality of the match
#store reference and query into separate variable for manipulation later
my $reference = $ref_dna;
my $query = $query_dna;
#from the CIGAR string, fill in the insertions and deletions
my $position = '0';
while ($cigar !~ /^$/){
if ($cigar =~ /^([0-9]+[MIDS])/){
my $cigar_part = $1;
if ($cigar_part =~ /(\d+)M/){
$position += $1;
} elsif ($cigar_part =~ /(\d+)I/){
my $insertion = '-' x $1;
substr($reference,$position,0,$insertion);
$position += $1;
} elsif ($cigar_part =~ /(\d+)D/){
my $insertion = '-' x $1;
substr($query,$position,0,$insertion);
$position += $1;
} elsif ($cigar_part =~ /(\d+)S/){
die "Not ready for this!\n";
#my $insertion = 'x' x $1;
#substr($new_ref,$position,0,$insertion);
#$position += $1;
}
$cigar =~ s/$cigar_part//;
} else {
die "Unexpected cigar: $id $cigar\n";
}
}
#in the bam files I process, the original tags are reverse complemented
#re-reverse complement to see the original tag
if ($strand eq '-'){
$query = rev_com($query);
$reference = rev_com($reference);
}
my $ed = '0';
for (my $i =0; $i < length($reference); ++$i){
my $q = substr($query,$i,1);
my $r = substr($reference,$i,1);
if ($q ne $r){
++$ed;
#annotated as a deletion into the reference sequence
if ($q eq '-'){
if (exists $error_profile{$i}{'deletion'}){
$error_profile{$i}{'deletion'}++;
} else {
$error_profile{$i}{'deletion'} = '1';
}
}
#annotated as an insertion into the reference sequence
elsif ($r eq '-'){
if (exists $error_profile{$i}{'insertion'}){
$error_profile{$i}{'insertion'}++;
} else {
$error_profile{$i}{'insertion'} = '1';
}
}
#mismatch
else {
if ($q eq 'A'){
if (exists $error_profile{$i}{$q}){
$error_profile{$i}{$q}++;
} else {
$error_profile{$i}{$q} = '1';
}
}
elsif ($q eq 'C'){
if (exists $error_profile{$i}{$q}){
$error_profile{$i}{$q}++;
} else {
$error_profile{$i}{$q} = '1';
}
}
elsif ($q eq 'G'){
if (exists $error_profile{$i}{$q}){
$error_profile{$i}{$q}++;
} else {
$error_profile{$i}{$q} = '1';
}
}
elsif ($q eq 'T'){
if (exists $error_profile{$i}{$q}){
$error_profile{$i}{$q}++;
} else {
$error_profile{$i}{$q} = '1';
}
}
elsif ($q eq 'N'){
if (exists $error_profile{$i}{$q}){
$error_profile{$i}{$q}++;
} else {
$error_profile{$i}{$q} = '1';
}
}
else {
die "That was unexpected q = $q que = $query ref = $reference\n";
}
}
}
}
#double check
die "Error in calculating edit distance nm: $nm_tag ed: $ed\n" if $ed ne $nm_tag;
#print "$id $strand $match_qual $nm_tag $md_tag\n";
#print "Que: $query\nRef: $reference\n\n";
}
}
foreach my $base (sort {$a <=> $b} keys %error_profile){
print "$base\n";
foreach my $type (keys %{$error_profile{$base}}){
print "\t$type\t$error_profile{$base}{$type}\n";
}
}
print "Total tag = $total_tag\n";
exit(0);
sub rev_com {
my ($seq) = @_;
$seq = reverse($seq);
$seq =~ tr/ACGT/TGCA/;
return($seq);
}
__END__
There are definitely much better ways of doing this. Actually, there is a tool called SAMStat that does this already (and more) if you open up the html file with the results, so use that instead.

This work is licensed under a Creative Commons
Attribution 4.0 International License.
Thanks, this saves me some time! I tried samstat, but I want to make my own figures…
No problems! Lincoln did all the hard work 🙂
I’m currently trying to understand the CIGAR and MD tag information for a Python program I’m writing, and as a non-bioinformatician the examples have been invaluable to me – thanks for putting these up!
One comment – I think that the recovered reference sequence in the last example (with both insertions and deletions) might be incorrect? The MD tag suggests that the final ‘A’ should be mutated to ‘T’. However I may have missed something.
Thanks again for the examples!
Hi Peter,
No worries, I am also a biologist by training 🙂
I have now updated the posts with the alignments created from the CIGAR string and MD tags.
It seems you are right about the last example. Thanks for letting me know!
Cheers,
Dave
Hey Peter
Just wondering if you wrote a parser for CIGAR and MD tag in python ? I am looking to do the same and would happily use yours 🙂
-Abhi
very useful; thanks
Thanks. Glad you found it useful.
Can you tell me how to understand the result with the SamStat? I got the output but can not kown about the mismatch.
Perhaps this page is useful http://davetang.org/wiki/tiki-index.php?page=SAMStat
Thank you very much !
Hi i find it very useful.
Just wondering if you have a more up to date version of that script ??
thanks,
david
No, sorry I don’t.