Perl and SAM

Lincoln Stein has written a bunch of modules to deal with SAM/BAM files. Check out the CPAN module. If you are having trouble installing Bio::DB::Sam, you may have to recompile SAMTools with the following command:

make CXXFLAGS=-fPIC CFLAGS=-fPIC CPPFLAGS=-fPIC

To install the Perl module on a machine where you don't have root access, follow these instructions.

Using this module, I recreated the alignment by parsing the CIGAR string and then went through the alignment to collect the edit statistics. The CIGAR line indicates the number of Matches/Mismatches, Insertions and Deletions in each alignment. The MD tag gives a better resolution of the nucleotides involved but using the module the reference sequence is generated for you already.

Some examples of how the CIGAR string and the MD tag annotates alignments:

No insertions or deletions:
Positive strand match CGATACGGGGACATCCGGCCTGCTCCTTCTCACATG CIGAR: 36M MD:Z:1A0C0C0C1T0C0T27
Reference genome = CACCCCTCTGACATCCGGCCTGCTCCTTCTCACATG

CGATACGGGGACATCCGGCCTGCTCCTTCTCACATG 
 ---- ---
CACCCCTCTGACATCCGGCCTGCTCCTTCTCACATG

Insertion:
Positive strand match GAGACGGGGTGACATCCGGCCTGCTCCTTCTCACAT CIGAR: 6M1I29M MD:Z:0C1C0C1C0T0C27
Reference genome = CACCCCTCTGACATCCGGCCTGCTCCTTCTCACAT

GAGACGGGGTGACATCCGGCCTGCTCCTTCTCACAT
- -- - --
CACCCC-TCTGACATCCGGCCTGCTCCTTCTCACAT

Deletion:
Positive strand match AGTGATGGGGGGGTTCCAGGTGGAGACGAGGACTCC CIGAR: 9M9D27M MD:Z:2G0A5^ATGATGTCA27
Reference genome = AGGAATGGGATGATGTCAGGGGTTCCAGGTGGAGACGAGGACTCC

AGTGATGGG---------GGGGTTCCAGGTGGAGACGAGGACTCC
  --     ---------
AGGAATGGGATGATGTCAGGGGTTCCAGGTGGAGACGAGGACTCC

And all three together:
Positive strand match AGTGATGGGAGGATGTCTCGTCTGTGAGTTACAGCA CIGAR: 2M1I7M6D26M MD:Z:3C3T1^GCTCAG26
Reference genome = AGGCTGGTAGCTCAGGGATGTCTCGTCTGTGAGTTACAGCA

AGTGATGGGA------GGATGTCTCGTCTGTGAGTTACAGCA
    -   -
AG-GCTGGTAGCTCAGGGATGTCTCGTCTGTGAGTTACAGCA

And below is my ugly code to output the frequency of mismatches/deletions/insertions of each base position in a tag alignment.

#!/usr/bin/perl

use strict;
use warnings;

use Bio::DB::Sam;
#use Data::Dumper;

my $usage = "Usage $0 <infile.bam>\n";
my $infile = shift or die $usage;

my $sam = Bio::DB::Sam->new(-bam  => $infile);

my %error_profile = ();

#store all the targets which should be all assembled chromosomes
my @targets = $sam->seq_ids;

@targets = 'chrM'; #comment out this line if you are game to test it on all alignments
my $total_tag = '0';

#iterate through each target individually, since each target is stored in memory
foreach my $target (@targets){
   warn "$target\n";
   my @alignments = $sam->get_features_by_location(-seq_id => $target);

   for my $a (@alignments) {
      ++$total_tag;

      #tag id
      my $id  = $a->name;

      #alignment information in the reference sequence
      #my $seqid  = $a->seq_id;
      #my $start  = $a->start;
      #my $end    = $a->end;
      #strand is stored as -1 or 1
      my $strand = $a->strand;
      $strand = '+' if $strand =~ /^1$/;
      $strand = '-' if $strand =~ /^-1$/;
      my $cigar  = $a->cigar_str;
      my $md_tag = $a->get_tag_values('MD');
      my $nm_tag = $a->get_tag_values('NM');

      die "$id MD tag not defined on line $." if ($md_tag =~ /^$/);
      die "$id NM tag not defined on line $." if ($nm_tag =~ /^$/);

      #alignment information in the query sequence
      #my $query_start = $a->query->start;
      #my $query_end   = $a->query->end;
      my $ref_dna   = $a->dna;        # reference sequence bases
      my $query_dna = $a->query->dna; # query sequence bases
      my @scores    = $a->qscore;     # per-base quality scores
      my $match_qual= $a->qual;       # quality of the match

      #store reference and query into separate variable for manipulation later
      my $reference = $ref_dna;
      my $query = $query_dna;

      #from the CIGAR string, fill in the insertions and deletions
      my $position = '0';
      while ($cigar !~ /^$/){
         if ($cigar =~ /^([0-9]+[MIDS])/){
            my $cigar_part = $1;
            if ($cigar_part =~ /(\d+)M/){
               $position += $1;
            } elsif ($cigar_part =~ /(\d+)I/){
               my $insertion = '-' x $1;
               substr($reference,$position,0,$insertion);
               $position += $1;
            } elsif ($cigar_part =~ /(\d+)D/){
               my $insertion = '-' x $1;
               substr($query,$position,0,$insertion);
               $position += $1;
            } elsif ($cigar_part =~ /(\d+)S/){
               die "Not ready for this!\n";
               #my $insertion = 'x' x $1;
               #substr($new_ref,$position,0,$insertion);
               #$position += $1;
            }
            $cigar =~ s/$cigar_part//;
         } else {
            die "Unexpected cigar: $id $cigar\n";
         }
      }

      #in the bam files I process, the original tags are reverse complemented
      #re-reverse complement to see the original tag
      if ($strand eq '-'){
         $query = rev_com($query);
         $reference = rev_com($reference);
      }

      my $ed = '0';
      for (my $i =0; $i < length($reference); ++$i){
         my $q = substr($query,$i,1);
         my $r = substr($reference,$i,1);
         if ($q ne $r){
            ++$ed;
            #annotated as a deletion into the reference sequence
            if ($q eq '-'){
               if (exists $error_profile{$i}{'deletion'}){
                  $error_profile{$i}{'deletion'}++;
               } else {
                  $error_profile{$i}{'deletion'} = '1';
               }
            }
            #annotated as an insertion into the reference sequence
            elsif ($r eq '-'){
               if (exists $error_profile{$i}{'insertion'}){
                  $error_profile{$i}{'insertion'}++;
               } else {
                  $error_profile{$i}{'insertion'} = '1';
               }
            }
            #mismatch
            else {
               if ($q eq 'A'){
                  if (exists $error_profile{$i}{$q}){
                     $error_profile{$i}{$q}++;
                  } else {
                     $error_profile{$i}{$q} = '1';
                  }
               }
               elsif ($q eq 'C'){
                  if (exists $error_profile{$i}{$q}){
                     $error_profile{$i}{$q}++;
                  } else {
                     $error_profile{$i}{$q} = '1';
                  }
               }
               elsif ($q eq 'G'){
                  if (exists $error_profile{$i}{$q}){
                     $error_profile{$i}{$q}++;
                  } else {
                     $error_profile{$i}{$q} = '1';
                  }
               }
               elsif ($q eq 'T'){
                  if (exists $error_profile{$i}{$q}){
                     $error_profile{$i}{$q}++;
                  } else {
                     $error_profile{$i}{$q} = '1';
                  }
               }
               elsif ($q eq 'N'){
                  if (exists $error_profile{$i}{$q}){
                     $error_profile{$i}{$q}++;
                  } else {
                     $error_profile{$i}{$q} = '1';
                  }
               }
               else {
                  die "That was unexpected q = $q que = $query ref = $reference\n";
               }
            }
         }
      }
      #double check
      die "Error in calculating edit distance nm: $nm_tag ed: $ed\n" if $ed ne $nm_tag;

      #print "$id $strand $match_qual $nm_tag $md_tag\n";
      #print "Que: $query\nRef: $reference\n\n";

   }
}

foreach my $base (sort {$a <=> $b} keys %error_profile){
   print "$base\n";
   foreach my $type (keys %{$error_profile{$base}}){
      print "\t$type\t$error_profile{$base}{$type}\n";
   }
}

print "Total tag = $total_tag\n";

exit(0);

sub rev_com {
   my ($seq) = @_;
   $seq = reverse($seq);
   $seq =~ tr/ACGT/TGCA/;
   return($seq);
}

__END__

There are definitely much better ways of doing this. Actually, there is a tool called SAMStat that does this already (and more) if you open up the html file with the results, so use that instead.




Creative Commons License
This work is licensed under a Creative Commons
Attribution 4.0 International License
.

13 thoughts on “Perl and SAM

  1. I'm currently trying to understand the CIGAR and MD tag information for a Python program I'm writing, and as a non-bioinformatician the examples have been invaluable to me - thanks for putting these up!

    One comment - I think that the recovered reference sequence in the last example (with both insertions and deletions) might be incorrect? The MD tag suggests that the final 'A' should be mutated to 'T'. However I may have missed something.

    Thanks again for the examples!

    • Hi Peter,

      No worries, I am also a biologist by training 🙂

      I have now updated the posts with the alignments created from the CIGAR string and MD tags.

      It seems you are right about the last example. Thanks for letting me know!

      Cheers,

      Dave

    • Hey Peter

      Just wondering if you wrote a parser for CIGAR and MD tag in python ? I am looking to do the same and would happily use yours 🙂

      -Abhi

  2. Pingback: New cool stuff in Sambamba! « notes on IT & science

  3. Can you tell me how to understand the result with the SamStat? I got the output but can not kown about the mismatch.

Leave a Reply

Your email address will not be published. Required fields are marked *

Time limit is exhausted. Please reload CAPTCHA.