Perl and SAM

Lincoln Stein has written a bunch of modules to deal with SAM/BAM files. Check out the CPAN module. If you are having trouble installing Bio::DB::Sam, you may have to recompile SAMTools with the following command:

make CXXFLAGS=-fPIC CFLAGS=-fPIC CPPFLAGS=-fPIC

To install the Perl module on a machine where you don't have root access, follow these instructions.

Using this module, I recreated the alignment by parsing the CIGAR string and then went through the alignment to collect the edit statistics. The CIGAR line indicates the number of Matches/Mismatches, Insertions and Deletions in each alignment. The MD tag gives a better resolution of the nucleotides involved but using the module the reference sequence is generated for you already.

Some examples of how the CIGAR string and the MD tag annotates alignments:

No insertions or deletions:
Positive strand match CGATACGGGGACATCCGGCCTGCTCCTTCTCACATG CIGAR: 36M MD:Z:1A0C0C0C1T0C0T27
Reference genome = CACCCCTCTGACATCCGGCCTGCTCCTTCTCACATG

CGATACGGGGACATCCGGCCTGCTCCTTCTCACATG 
 ---- ---
CACCCCTCTGACATCCGGCCTGCTCCTTCTCACATG

Insertion:
Positive strand match GAGACGGGGTGACATCCGGCCTGCTCCTTCTCACAT CIGAR: 6M1I29M MD:Z:0C1C0C1C0T0C27
Reference genome = CACCCCTCTGACATCCGGCCTGCTCCTTCTCACAT

GAGACGGGGTGACATCCGGCCTGCTCCTTCTCACAT
- -- - --
CACCCC-TCTGACATCCGGCCTGCTCCTTCTCACAT

Deletion:
Positive strand match AGTGATGGGGGGGTTCCAGGTGGAGACGAGGACTCC CIGAR: 9M9D27M MD:Z:2G0A5^ATGATGTCA27
Reference genome = AGGAATGGGATGATGTCAGGGGTTCCAGGTGGAGACGAGGACTCC

AGTGATGGG---------GGGGTTCCAGGTGGAGACGAGGACTCC
  --     ---------
AGGAATGGGATGATGTCAGGGGTTCCAGGTGGAGACGAGGACTCC

And all three together:
Positive strand match AGTGATGGGAGGATGTCTCGTCTGTGAGTTACAGCA CIGAR: 2M1I7M6D26M MD:Z:3C3T1^GCTCAG26
Reference genome = AGGCTGGTAGCTCAGGGATGTCTCGTCTGTGAGTTACAGCA

AGTGATGGGA------GGATGTCTCGTCTGTGAGTTACAGCA
    -   -
AG-GCTGGTAGCTCAGGGATGTCTCGTCTGTGAGTTACAGCA

And below is my ugly code to output the frequency of mismatches/deletions/insertions of each base position in a tag alignment.

#!/usr/bin/perl

use strict;
use warnings;

use Bio::DB::Sam;
#use Data::Dumper;

my $usage = "Usage $0 <infile.bam>\n";
my $infile = shift or die $usage;

my $sam = Bio::DB::Sam->new(-bam  => $infile);

my %error_profile = ();

#store all the targets which should be all assembled chromosomes
my @targets = $sam->seq_ids;

@targets = 'chrM'; #comment out this line if you are game to test it on all alignments
my $total_tag = '0';

#iterate through each target individually, since each target is stored in memory
foreach my $target (@targets){
   warn "$target\n";
   my @alignments = $sam->get_features_by_location(-seq_id => $target);

   for my $a (@alignments) {
      ++$total_tag;

      #tag id
      my $id  = $a->name;

      #alignment information in the reference sequence
      #my $seqid  = $a->seq_id;
      #my $start  = $a->start;
      #my $end    = $a->end;
      #strand is stored as -1 or 1
      my $strand = $a->strand;
      $strand = '+' if $strand =~ /^1$/;
      $strand = '-' if $strand =~ /^-1$/;
      my $cigar  = $a->cigar_str;
      my $md_tag = $a->get_tag_values('MD');
      my $nm_tag = $a->get_tag_values('NM');

      die "$id MD tag not defined on line $." if ($md_tag =~ /^$/);
      die "$id NM tag not defined on line $." if ($nm_tag =~ /^$/);

      #alignment information in the query sequence
      #my $query_start = $a->query->start;
      #my $query_end   = $a->query->end;
      my $ref_dna   = $a->dna;        # reference sequence bases
      my $query_dna = $a->query->dna; # query sequence bases
      my @scores    = $a->qscore;     # per-base quality scores
      my $match_qual= $a->qual;       # quality of the match

      #store reference and query into separate variable for manipulation later
      my $reference = $ref_dna;
      my $query = $query_dna;

      #from the CIGAR string, fill in the insertions and deletions
      my $position = '0';
      while ($cigar !~ /^$/){
         if ($cigar =~ /^([0-9]+[MIDS])/){
            my $cigar_part = $1;
            if ($cigar_part =~ /(\d+)M/){
               $position += $1;
            } elsif ($cigar_part =~ /(\d+)I/){
               my $insertion = '-' x $1;
               substr($reference,$position,0,$insertion);
               $position += $1;
            } elsif ($cigar_part =~ /(\d+)D/){
               my $insertion = '-' x $1;
               substr($query,$position,0,$insertion);
               $position += $1;
            } elsif ($cigar_part =~ /(\d+)S/){
               die "Not ready for this!\n";
               #my $insertion = 'x' x $1;
               #substr($new_ref,$position,0,$insertion);
               #$position += $1;
            }
            $cigar =~ s/$cigar_part//;
         } else {
            die "Unexpected cigar: $id $cigar\n";
         }
      }

      #in the bam files I process, the original tags are reverse complemented
      #re-reverse complement to see the original tag
      if ($strand eq '-'){
         $query = rev_com($query);
         $reference = rev_com($reference);
      }

      my $ed = '0';
      for (my $i =0; $i < length($reference); ++$i){
         my $q = substr($query,$i,1);
         my $r = substr($reference,$i,1);
         if ($q ne $r){
            ++$ed;
            #annotated as a deletion into the reference sequence
            if ($q eq '-'){
               if (exists $error_profile{$i}{'deletion'}){
                  $error_profile{$i}{'deletion'}++;
               } else {
                  $error_profile{$i}{'deletion'} = '1';
               }
            }
            #annotated as an insertion into the reference sequence
            elsif ($r eq '-'){
               if (exists $error_profile{$i}{'insertion'}){
                  $error_profile{$i}{'insertion'}++;
               } else {
                  $error_profile{$i}{'insertion'} = '1';
               }
            }
            #mismatch
            else {
               if ($q eq 'A'){
                  if (exists $error_profile{$i}{$q}){
                     $error_profile{$i}{$q}++;
                  } else {
                     $error_profile{$i}{$q} = '1';
                  }
               }
               elsif ($q eq 'C'){
                  if (exists $error_profile{$i}{$q}){
                     $error_profile{$i}{$q}++;
                  } else {
                     $error_profile{$i}{$q} = '1';
                  }
               }
               elsif ($q eq 'G'){
                  if (exists $error_profile{$i}{$q}){
                     $error_profile{$i}{$q}++;
                  } else {
                     $error_profile{$i}{$q} = '1';
                  }
               }
               elsif ($q eq 'T'){
                  if (exists $error_profile{$i}{$q}){
                     $error_profile{$i}{$q}++;
                  } else {
                     $error_profile{$i}{$q} = '1';
                  }
               }
               elsif ($q eq 'N'){
                  if (exists $error_profile{$i}{$q}){
                     $error_profile{$i}{$q}++;
                  } else {
                     $error_profile{$i}{$q} = '1';
                  }
               }
               else {
                  die "That was unexpected q = $q que = $query ref = $reference\n";
               }
            }
         }
      }
      #double check
      die "Error in calculating edit distance nm: $nm_tag ed: $ed\n" if $ed ne $nm_tag;

      #print "$id $strand $match_qual $nm_tag $md_tag\n";
      #print "Que: $query\nRef: $reference\n\n";

   }
}

foreach my $base (sort {$a <=> $b} keys %error_profile){
   print "$base\n";
   foreach my $type (keys %{$error_profile{$base}}){
      print "\t$type\t$error_profile{$base}{$type}\n";
   }
}

print "Total tag = $total_tag\n";

exit(0);

sub rev_com {
   my ($seq) = @_;
   $seq = reverse($seq);
   $seq =~ tr/ACGT/TGCA/;
   return($seq);
}

__END__

There are definitely much better ways of doing this. Actually, there is a tool called SAMStat that does this already (and more) if you open up the html file with the results, so use that instead.

Normalisation methods implemented in edgeR

Updated 2014 December 12th

A short post on the different normalisation methods implemented within edgeR; to see the normalisation methods type:

#install if necessary
source("http://bioconductor.org/biocLite.R")
biocLite("edgeR")
library(edgeR)
?calcNormFactors

From the documentation:

method="TMM" is the weighted trimmed mean of M-values (to the reference) proposed by Robinson and Oshlack (2010), where the weights are from the delta method on Binomial data. If refColumn is unspecified, the library whose upper quartile is closest to the mean upper quartile is used.

method="RLE" is the scaling factor method proposed by Anders and Huber (2010). We call it "relative log expression", as median library is calculated from the geometric mean of all columns and the median ratio of each sample to the median library is taken as the scale factor.

method="upperquartile" is the upper-quartile normalization method of Bullard et al (2010), in which the scale factors are calculated from the 75% quantile of the counts for each library, after removing transcripts which are zero in all libraries. This idea is generalized here to allow scaling by any quantile of the distributions.

If method="none", then the normalization factors are set to 1.

For symmetry, normalization factors are adjusted to multiply to 1. The effective library size is then the original library size multiplied by the scaling factor.

Note that rows that have zero counts for all columns are trimmed before normalization factors are computed. Therefore rows with all zero counts do not affect the estimated factors.

A test dataset

I created a dataset to test the different normalisation methods. There are four samples; column one and two are the controls and column three and four are the patients. 25 transcripts are in all four samples in equal amount. Another 25 transcripts are only present in the controls and their expression is the same as the first 25 transcripts in the controls. The four samples have exactly the same depth (500 counts). However, since the patient samples have half the number of transcripts than the controls (25 versus 50), they are sequenced at twice the depth. This hypothetical situation was described in Robinson and Oshlack.

#prepare example
control_1 <- rep(10, 50)
control_2 <- rep(10, 50)
patient_1 <- c(rep(20, 25),rep(0,25))
patient_2 <- c(rep(20, 25),rep(0,25))

df <- data.frame(c1=control_1,
                 c2=control_2,
                 p1=patient_1,
                 p2=patient_2)

head(df)
  c1 c2 p1 p2
1 10 10 20 20
2 10 10 20 20
3 10 10 20 20
4 10 10 20 20
5 10 10 20 20
6 10 10 20 20

tail(df)
   c1 c2 p1 p2
45 10 10  0  0
46 10 10  0  0
47 10 10  0  0
48 10 10  0  0
49 10 10  0  0
50 10 10  0  0

#equal depth
colSums(df)
 c1  c2  p1  p2 
500 500 500 500

No normalisation

Let's run the differential expression analysis without any normalisation step:

#load library
library(edgeR)

#create group vector
group <- c('control','control','patient','patient')

#create DGEList object
d <- DGEList(counts=df, group=group)

#check out the DGEList object
d
An object of class "DGEList"
$counts
  c1 c2 p1 p2
1 10 10 20 20
2 10 10 20 20
3 10 10 20 20
4 10 10 20 20
5 10 10 20 20
45 more rows ...

$samples
     group lib.size norm.factors
c1 control      500            1
c2 control      500            1
p1 patient      500            1
p2 patient      500            1

d <- DGEList(counts=df, group=group)
d <- estimateCommonDisp(d)

#perform the DE test
de <- exactTest(d)

#how many differentially expressed transcripts?
table(p.adjust(de$table$PValue, method="BH")<0.05)

TRUE 
  50

Without normalisation, every transcript is differentially expressed.

TMM normalisation

Let's test the weighted trimmed mean of M-values method:

TMM <- calcNormFactors(d, method="TMM")
TMM
An object of class "DGEList"
$counts
  c1 c2 p1 p2
1 10 10 20 20
2 10 10 20 20
3 10 10 20 20
4 10 10 20 20
5 10 10 20 20
45 more rows ...

$samples
     group lib.size norm.factors
c1 control      500    0.7071068
c2 control      500    0.7071068
p1 patient      500    1.4142136
p2 patient      500    1.4142136

If we use the scaling, for transcript 1 we would compare 10/0.7071068 (~14.14) to 20/1.4142136 (~14.14) and correctly observe that they are not differentially expressed. Let's run the differential expression test:

TMM <- estimateCommonDisp(TMM)
TMM <- exactTest(TMM)
table(p.adjust(TMM$table$PValue, method="BH")<0.05)

FALSE  TRUE 
   25    25

Only half of the transcripts are differentially expressed (the last 25 transcripts in the control samples).

RLE normalisation

Let's test the relative log expression normalisation method:

RLE <- calcNormFactors(d, method="RLE")
RLE
An object of class "DGEList"
$counts
  c1 c2 p1 p2
1 10 10 20 20
2 10 10 20 20
3 10 10 20 20
4 10 10 20 20
5 10 10 20 20
45 more rows ...

$samples
     group lib.size norm.factors
c1 control      500    0.7071068
c2 control      500    0.7071068
p1 patient      500    1.4142136
p2 patient      500    1.4142136
RLE <- estimateCommonDisp(RLE)
RLE <- exactTest(RLE)
table(p.adjust(RLE$table$PValue, method="BH")<0.05)

FALSE  TRUE 
   25    25

Upper-quartile normalisation

Lastly let's try the upper-quartile normalisation method:

uq <- calcNormFactors(d, method="upperquartile")
uq
An object of class "DGEList"
$counts
  c1 c2 p1 p2
1 10 10 20 20
2 10 10 20 20
3 10 10 20 20
4 10 10 20 20
5 10 10 20 20
45 more rows ...

$samples
     group lib.size norm.factors
c1 control      500    0.7071068
c2 control      500    0.7071068
p1 patient      500    1.4142136
p2 patient      500    1.4142136

uq <- estimateCommonDisp(uq)
uq <- exactTest(uq)
table(p.adjust(uq$table$PValue, method="BH")<0.05)

FALSE  TRUE 
   25    25

Testing a real dataset

require(RCurl)
my_file <- getURL("https://dl.dropboxusercontent.com/u/15251811/pnas_expression.txt")
data <- read.table(textConnection(my_file), header=T, sep="\t")

d <- data[,2:8]
rownames(d) <- data[,1]
group <- c(rep("Control",4),rep("DHT",3))
d <- DGEList(counts = d, group=group)

#no normalisation
no_norm <- estimateCommonDisp(d)
no_norm <- exactTest(no_norm)
table(p.adjust(no_norm$table$PValue, method="BH")<0.05)

FALSE  TRUE 
33404  4031

#TMM
TMM <- calcNormFactors(d, method="TMM")
TMM
An object of class "DGEList"
$counts
                lane1 lane2 lane3 lane4 lane5 lane6 lane8
ENSG00000215696     0     0     0     0     0     0     0
ENSG00000215700     0     0     0     0     0     0     0
ENSG00000215699     0     0     0     0     0     0     0
ENSG00000215784     0     0     0     0     0     0     0
ENSG00000212914     0     0     0     0     0     0     0
37430 more rows ...

$samples
        group lib.size norm.factors
lane1 Control   978576    1.0350786
lane2 Control  1156844    1.0379515
lane3 Control  1442169    1.0287815
lane4 Control  1485604    1.0222095
lane5     DHT  1823460    0.9446243
lane6     DHT  1834335    0.9412769
lane8     DHT   681743    0.9954283

TMM <- estimateCommonDisp(TMM)
TMM <- exactTest(TMM)
table(p.adjust(TMM$table$PValue, method="BH")<0.05)

FALSE  TRUE 
33519  3916

#RLE
RLE <- calcNormFactors(d, method="RLE")
RLE
An object of class "DGEList"
$counts
                lane1 lane2 lane3 lane4 lane5 lane6 lane8
ENSG00000215696     0     0     0     0     0     0     0
ENSG00000215700     0     0     0     0     0     0     0
ENSG00000215699     0     0     0     0     0     0     0
ENSG00000215784     0     0     0     0     0     0     0
ENSG00000212914     0     0     0     0     0     0     0
37430 more rows ...

$samples
        group lib.size norm.factors
lane1 Control   978576    1.0150010
lane2 Control  1156844    1.0236675
lane3 Control  1442169    1.0345426
lane4 Control  1485604    1.0399724
lane5     DHT  1823460    0.9706692
lane6     DHT  1834335    0.9734955
lane8     DHT   681743    0.9466713

RLE <- estimateCommonDisp(RLE)
RLE <- exactTest(RLE)
table(p.adjust(RLE$table$PValue, method="BH")<0.05)

FALSE  TRUE 
33465  3970

#upper-quartile
uq <- calcNormFactors(d, method="upperquartile")
uq
An object of class "DGEList"
$counts
                lane1 lane2 lane3 lane4 lane5 lane6 lane8
ENSG00000215696     0     0     0     0     0     0     0
ENSG00000215700     0     0     0     0     0     0     0
ENSG00000215699     0     0     0     0     0     0     0
ENSG00000215784     0     0     0     0     0     0     0
ENSG00000212914     0     0     0     0     0     0     0
37430 more rows ...

$samples
        group lib.size norm.factors
lane1 Control   978576    1.0272514
lane2 Control  1156844    1.0222982
lane3 Control  1442169    1.0250528
lane4 Control  1485604    1.0348864
lane5     DHT  1823460    0.9728534
lane6     DHT  1834335    0.9670858
lane8     DHT   681743    0.9541011

uq <- estimateCommonDisp(uq)
uq <- exactTest(uq)
table(p.adjust(uq$table$PValue, method="BH")<0.05)

FALSE  TRUE 
33466  3969

DESeq vs. edgeR vs. baySeq

6th April 2012: For a more updated version of this post, please refer see this post.

A very simple comparison

Using the TagSeqExample.tab file from the DESeq package as the benchmark dataset. According to DESeq authors, T1a and T1b are similar, so I removed the second column in the file corresponding to T1a:

cut --complement -f2 TagSeqExample.tab &gt; tag_seq_example_less_t1a.tsv

Hierarchical clustering

To get an idea of how similar the libraries are, I performed hierarchical clustering using the Spearman correlation of libraries (see here). Note that these raw reads have not been adjusted for sequencing depth, but hopefully by using a ranked correlation method we can bypass this necessity.

From the dendrogram, we observe the marked difference of the normal samples to the tumour samples.

Installing the R packages

I then installed all the required packages (R version 2.12.0, biocinstall version 2.7.4.):

source("http://www.bioconductor.org/biocLite.R")
biocLite("baySeq")
biocLite("DESeq")
biocLite("edgeR")

DESeq

DESeq code adapted from the vignette:

countsTable <- read.delim("tag_seq_example_less_t1a.tsv", header=TRUE, stringsAsFactors=TRUE)
rownames(countsTable) <- countsTable$gene
countsTable <- countsTable[,-1]
head(countsTable)
conds = c("T","T","T","N","N")
cds<-newCountDataSet(countsTable,conds)
cds<-estimateSizeFactors(cds)
sizeFactors(cds)
T1b T2 T3 N1 N2
0.5587394 1.5823096 1.1270425 1.2869337 0.8746998
cds <- estimateVarianceFunctions (cds)
res <- nbinomTest (cds, "T", "N")
resSig <- res[ res$padj < .1, ]
nrow(resSig)
[1] 1072
write.csv(resSig,file="deseq_res_sig.csv")

edgeR

First transform the tag_seq_example_less_t1a.tsv file to the edgeR format using this script.

tsv_to_edger.pl tag_seq_example_less_t1a.tsv

library(edgeR)
targets = read.delim(file = "targets.txt", stringsAsFactors = FALSE)
d = readDGE(targets,comment.char="#")
d$samples
files group description lib.size norm.factors
1 T1b patient patient 2399311 1
2 T2 patient patient 7202770 1
3 T3 patient patient 5856461 1
4 N1 control control 6376307 1
5 N2 control control 3931277 1

Recently edgeR added a normalisation function into the package called calcNormFactors(). For more information see this paper.

d = calcNormFactors(d)
d$samples
files group description lib.size norm.factors
1 T1b patient patient 2399311 1.1167157
2 T2 patient patient 7202770 0.9978323
3 T3 patient patient 5856461 0.9170402
4 N1 control control 6376307 0.9546010
5 N2 control control 3931277 1.0251550

This approach is different from the one taken by DESeq.

d = estimateCommonDisp(d)
de.com = exactTest(d)
options(digits=4)
topTags(de.com)
detags.com = rownames(topTags(de.com)$table)

sum(de.com$table$p.value < 0.01)
[1] 895

sum(p.adjust(de.com$table$p.value,method="BH") < 0.1)
[1] 593

good = sum(p.adjust(de.com$table$p.value,method="BH") < 0.1)
goodList = topTags(de.com, n=good)
sink("edger.txt")
goodList
sink()

baySeq

library(baySeq)
all = read.delim("tag_seq_example_less_t1a.tsv", header=TRUE, sep="\t")
lib <- c(2399311,7202770,5856461,6376307,3931277)
replicates <- c(1,1,1,2,2)
groups <- list(NDE = c(1,1,1,1,1), DE = c(1,1,1,2,2) )
cname <- all[,1]
all <- all[,-1]
all = as.matrix(all)
CD <- new("countData", data = all, replicates = replicates, libsizes = as.integer(lib), groups = groups)
CD@annotation <- as.data.frame(cname)
cl <- NULL
CDP.NBML <- getPriors.NB(CD, samplesize = 1000, estimation = "QL", cl = cl)
CDPost.NBML <- getLikelihoods.NB(CDP.NBML, pET = 'BIC', cl = cl)
CDPost.NBML@estProps
[1] 0.8897621 0.1102379
topCounts(CDPost.NBML, group=2)
NBML.TPs <- getTPs(CDPost.NBML, group=2, TPs = 1:100)
bayseq_de = topCounts(CDPost.NBML, group=2, number=2000)
write.csv(bayseq_de, file="bayseq_de.csv")

Overlap

R package Number of differentially expressed genes
DESeq 888
edgeR 895
baySeq 1115
Total 18,760

How to make a Venn diagram

DESeq with edgeR 591
DESeq with baySeq 488
edgeR with baySeq 465
DESeq with edgeR with baySeq 338

This is a first draft post and will be continually updated in the future, but I just thought I'll publish it first.

DESeq

Code taken from the DESeq vignette for my own convenience.

library("DESeq")

exampleFile = system.file ("extra/TagSeqExample.tab",package="DESeq")

countsTable = read.delim(exampleFile, header=TRUE, stringsAsFactors=TRUE)

rownames(countsTable) = countsTable$gene

countsTable = countsTable[ , -1]

conds = c("T","T","T","Tb","N","N")

cds = newCountDataSet (countsTable, conds)

cds = cds[,-1]

cds = estimateSizeFactors(cds)

sizeFactors(cds)

cds <- estimateVarianceFunctions( cds )

res <- nbinomTest( cds, "N", "T" )

resSig <- res[ res$padj < .1, ]