DESeq vs. edgeR vs. baySeq using pnas_expression.txt

Following the instructions from a previous post, I filtered the pnas_expression.txt dataset and saved the results in "pnas_expression_filtered.tsv" and then performed the differential gene expression analyses using the respective packages.

To run the Perl scripts below, just save the code into a file and name it "something.pl". Then make the file executable by running "chmod 755 something.pl". To execute the file in your current directory "./something.pl".

The input file (pnas_expression_filtered.tsv) looks like this:

ensembl_ID lane1 lane2 lane3 lane4 lane5 lane6 lane8
ENSG00000124208 478 619 628 744 483 716 240
ENSG00000182463 27 20 27 26 48 55 24
ENSG00000124201 180 218 293 275 373 301 88

And finally please refer to the respective Bioconductor packages for more information.

sessionInfo()

sessionInfo()
R version 2.15.0 (2012-03-30)
Platform: x86_64-pc-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=C                 LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] snow_0.3-8         baySeq_1.10.0      DESeq_1.8.0        locfit_1.5-7      
[5] Biobase_2.16.0     BiocGenerics_0.2.0 edgeR_2.6.0        limma_3.12.0      

loaded via a namespace (and not attached):
 [1] annotate_1.34.0      AnnotationDbi_1.18.0 DBI_0.2-5           
 [4] genefilter_1.38.0    geneplotter_1.34.0   grid_2.15.0         
 [7] IRanges_1.14.2       lattice_0.20-6       RColorBrewer_1.0-5  
[10] RSQLite_0.11.1       splines_2.15.0       stats4_2.15.0       
[13] survival_2.36-12     xtable_1.7-0

DESeq

library("DESeq")
countsTable <- read.delim("pnas_expression_filtered.tsv", header=TRUE, row.names=1)
conds <- c(rep("C",4),rep("T",3))
cds <- newCountDataSet (countsTable, conds)
cds <- estimateSizeFactors(cds)
sizeFactors(cds)
cds <- estimateDispersions(cds)
str(fitInfo(cds))

plotDispEsts <- function(cds){
   plot(rowMeans(counts(cds, normalized=TRUE)),
   fitInfo(cds)$perGeneDispEsts,
   pch = '.', log="xy")
   xg <- 10^seq(-.5,5,length.out=300)
   lines(xg,fitInfo(cds)$dispFun(xg),col="red")
}

plotDispEsts(cds)

res <- nbinomTest(cds,"C","T")

head(res)

plotDE <- function(res)
   plot(res$baseMean,res$log2FoldChange,log="x",pch=20,cex=.3,col=ifelse(res$padj < .1, "red","black"))

plotDE(res)

hist(res$pval, breaks=100, col="skyblue", border="slateblue", main="")

resSig <- res[res$padj < 0.1,]

head( resSig[order(resSig$pval),])

write.table (res, file="results_deseq.txt")

For parsing the results_deseq.txt, I wrote this script, which outputs results with a adjusted p-value of 0.05 or less and sorted by the adjusted p-value:

#!/usr/bin/perl

use strict;
use warnings;

my $usage = "Usage: $0 <deseq.results>\n";
my $infile = shift or die $usage;

my %result = ();
my $counter = '0';

open(IN,'<',$infile) || die "Could not open $infile: $!\n";
while(<IN>){
   chomp;

   #"id" "baseMean" "baseMeanA" "baseMeanB" "foldChange" "log2FoldChange" "pval" "padj"
   #"1" "ENSG00000124208" 529.399951591274 598.268222888992 437.575589860984 0.731403696736499 -0.451260176370596 0.00157794968191143 0.0126404575296003

   if (/^\"id/){
      next;
   }
   my ($junk,$id,$base_mean,$base_mean_a,$base_mean_b,$fold_change,$log_2_fold_change,$pval,$padj) = split(/\s+/);
   next unless $padj <= 0.05;
   ++$counter;
   $result{$counter}->{'all'} = $_;
   $result{$counter}->{'padj'} = $padj;
}
close(IN);

foreach my $num (sort {$result{$a}->{'padj'} <=> $result{$b}->{'padj'}} keys %result){
   print $result{$num}->{'all'},"\n";
}

exit(0);

__END__

edgeR

library(edgeR)
data <- read.table("pnas_expression_filtered.tsv", header=TRUE, row.names=1)

d <- data[,1:7]
rownames(d) <- row.names(data)

group <- c(rep("Control",4),rep("Test",3))
d <- DGEList(counts = d, group=group)
dim(d)
d <- calcNormFactors(d)
d
d <- estimateCommonDisp(d)
d$common.dispersion
sqrt(d$common.dispersion)
de.com <- exactTest(d)
summary(decideTestsDGE(de.com,p.value=0.01))
summary(decideTestsDGE(de.com,p.value=0.05))
com = summary(decideTestsDGE(de.com,p.value=0.05))
com_total = com[1] + com[3]

sink("de_common_dispersion.txt")
topTags(de.com,n=com_total)
sink()

getPriorN(d)
d <- estimateTagwiseDisp(d, prop.used=0.5, grid.length=500)
png(file="tgw_disp_vs_log_conc.png", height=1200, width=1200)
plot(log2(d$conc$conc.common),d$tagwise.dispersion,panel.first=grid())
abline(h=d$common.dispersion,col="dodgerblue",lwd=3)
dev.off()
de.tgw <- exactTest(d)
topTags(de.tgw)
summary(decideTestsDGE(de.tgw, p.value=0.01))
summary(decideTestsDGE(de.tgw, p.value=0.05))
bob = summary(decideTestsDGE(de.tgw,p.value=0.05))
total = bob[1] + bob[3]

sink("de_tagwise_dispersion.txt")
topTags(de.tgw,n=total)
sink()

baySeq

library(baySeq)
data <- read.delim("pnas_expression_filtered.tsv",header=T)
replicates <- c(1,1,1,1,2,2,2)
groups <- list(NDE = c(1,1,1,1,1,1,1),DE = c(1,1,1,1,2,2,2))
cname <- data[,1]
data <- data[,-1]
data = as.matrix(data)
CD <- new("countData", data = data, replicates = replicates, groups = groups)
CD@libsizes <- getLibsizes(CD)
plotMA.CD(CD, samplesA = "1", samplesB = "2", col = c(rep("red", 100), rep("black", 900)))
library(snow)
cl <- makeCluster(8, "SOCK")
CD@annotation <- as.data.frame(cname)
CD <- getPriors.NB(CD, samplesize = 10000, estimation = "QL", cl = cl)
CD <- getLikelihoods.NB(CD, pET = 'BIC', cl = cl)
CD@estProps
topCounts(CD, group = "DE")
blah <- topCounts(CD,group="DE",FDR=0.05)
write.csv(blah, file="bayseq_de.csv")

Results

Number of statistically significant (adjusted p-value of 0.05 or less) differentially expressed genes

DESeq: 2,796
edgeR (tagwise): 4,453
baySeq: 2,847

To compare results, I wrote this Perl script:

#!/usr/bin/perl

#Compares two lists and reports entries in both files

use strict;
use warnings;
my $usage = "Usage $0 <firstList> <secondList> <in or out>\n";
my $first = shift or die $usage;
my $second = shift or die $usage;
my $option = shift or die $usage;

die $usage unless $option =~ /[inout]+/i;

my %firstList = ();
my %secondList = ();

open (IN, $first) or die "Can't open $first\n";
while (<IN>){
   chomp;
   $firstList{$_} = $_;
}
close (IN);

open (IN, $second) or die "Can't open $second\n";
while (<IN>){
   chomp;
   $secondList{$_} = $_;
}
close (IN);

if ($option =~ /^[in]+/i){
   for my $blah (keys %firstList){
      if (exists $secondList{$blah}){
         print "$blah is in both lists\n";
      }
   }
} elsif ($option =~ /^[out]+/i){
   foreach my $first (keys %firstList){
      print "<$first\n" unless exists $secondList{$first};
   }
   foreach my $second (keys %secondList){
      print ">$second\n" unless exists $firstList{$second};
   }
}

Store the top 100 DE genes from the respective packages:

parse_deseq.pl results_deseq.txt | cut -f2 -d ' ' | sed 's/"//g;' | head -100 > deseq.list
cat de_tagwise_dispersion.txt | grep -v "Comparison" | grep -v "logFC" | awk '{print $1}' | head -100 > edger.list
cat bayseq_de.csv | grep -v "cname" | cut -f2 -d',' | sed 's/"//g' | head -100 > bayseq.list

Comparing the top 100 differentially expressed genes

compare_list.pl deseq.list edger.list in | wc -l
88
compare_list.pl deseq.list bayseq.list in | wc -l
65
compare_list.pl edger.list bayseq.list in | wc -l
60

When comparing the top 100 (ranked by adjusted p-values) list from DESeq and edgeR, we see an overlap of 88 / 100 of the candidates. DESeq and baySeq had an overlap of 65 / 100 and edgeR and baySeq had an overlap of 60 / 100.

Store the top 500 DE genes from the respective packages:

parse_deseq.pl results_deseq.txt | cut -f2 -d ' ' | sed 's/"//g;' | head -500 > deseq_500.list
cat de_tagwise_dispersion.txt | grep -v "Comparison" | grep -v "logFC" | awk '{print $1}' | head -500 > edger_500.list
cat bayseq_de.csv | grep -v "cname" | cut -f2 -d',' | sed 's/"//g' | head -500 > bayseq_500.list

Comparing the top 500 differentially expressed genes

compare_list.pl deseq_500.list edger_500.list in | wc -l
450
compare_list.pl deseq_500.list bayseq_500.list in | wc -l
424
compare_list.pl edger_500.list bayseq_500.list in | wc -l
440

When comparing the top 500 (ranked by adjusted p-values) list from DESeq and edgeR, we see an overlap of 450 / 500 of the candidates. DESeq and baySeq had an overlap of 424 / 500 and edgeR and baySeq had an overlap of 440 / 500.

For comparison with a non-parametric method, see this post.




Creative Commons License
This work is licensed under a Creative Commons
Attribution 4.0 International License
.
17 comments Add yours
  1. This is a great source of information! Have you thought about adding this as a tutorial on the re-released Biostars forum?

    1. Hi Ian,

      I haven’t really thought about it but if I were to include it as a tutorial, I think I need to greatly expand on the statistical intrepretation of the results. And to be honest, I don’t feel qualified. But thanks for the nice compliment!

      Dave

  2. Dave, have you though of adding voom (now part of limma) to this comparison? i’m currently trying to wrangle some bacterial RNAseq data, and although I can get good separation (based on a plotMDS plot), my results are really not what I expected (ie, I cant answer the biological question i know is there in the data I have), and all packages seem to give me huge numbers of differentially expressed genes.

    1. Hey Jason!

      I have not used limma since my impression was that limma can only be used for homoscedastic data (e.g. microarrays), whereas count data is heteroscedastic. But perhaps voom deals with this, so I’ll have a look at it some time.

      How consistent are your results from the various packages? I guess if there is some consistency, then perhaps they are reliable.

      Since all these methods are parametric, you can try a non-parametric method such as SAMSeq (http://davetang.org/muse/2012/01/25/edger-vs-samseq/) or cuffdiff (something I intend to look at later).

      Good luck!

      Dave

  3. I recently came across these posts, you have some interesting approaches (and I learned about pvclust), have you tried any PCA based approaches (something like sparse pca) for NGS (RNA seq read count) analysis ?

    1. Hi Sudeep,

      I’ve only tried traditional PCA (using prcomp() in R). I’m learning new things along the way and perhaps sometime later I may come across PCA based approaches.

      Cheers,

      Dave

    1. Thanks! Please do note that this post is getting quite old; hopefully I find some time to update it soon.

  4. Dave, this post is very helpful, I learned much from it, though I can’t seem to get the edgeR get to work.

    This line is giving me an error :
    plot(log2(d$conc$conc.common),d$tagwise.dispersion,panel.first=grid())

    Error in log2(d$conc$conc.common) :
    Non-numeric argument to mathematical function

    Someone has any idea ?

  5. Hi Dave
    The post is useful
    But when I ran
    “parse_deseq.pl results_deseq.txt | cut -f2 -d ‘ ‘ | sed ‘s/”//g;’ | head -100 > deseq.list”
    I have the error,
    “Use of uninitialized value $padj in numeric le (<=) at parse_deseq1.pl line 22, line 1115”

    My dataset “results_deseq.txt”
    “”,”id”,”baseMean”,”baseMeanA”,”baseMeanB”,”foldChange”,”log2FoldChange”,”pval”,”padj”
    “1”,”FBgn0000003″,0,0,0,NA,NA,NA,NA
    “2”,”FBgn0000008″,592.702193588692,603.200284272345,582.204102905039,0.965192023421152,-0.0511121020629761,0.69403015318515,1
    “3”,”FBgn0000014″,96.127484977601,83.5990305258537,108.655939429348,1.29972726652309,0.378208921319469,0.221523493353752,1

    I don’t why I get this bug?
    Thank you!

    1. Hello,

      I think what’s causing the error is that the script is splitting the lines by spaces, but your file is comma delimited. Change line 22 to this:

      my ($junk,$id,$base_mean,$base_mean_a,$base_mean_b,$fold_change,$log_2_fold_change,$pval,$padj) = split(/,/);

      Cheers,

      Dave

      1. Thank you Dave. Now the error
        Argument “NA” isn’t numeric in numeric le (<=) at parse_deseq1.pl line 22
        Do I need to preprocess to remove these genes cos the fold change is 0.

        1. The error is caused because the script expects the adjusted p-value to be a number, so it complains when it comes across the NA string. Add this line of code to skip the NAs:

          next if $padj eq 'NA';

          You can add it just above this line:

          next unless $padj <= 0.05;

          1. Sorry Still have the error

            Argument “”padj”” isn’t numeric in numeric le (<=) at parse_deseq1.pl line 23, line 1.

            Sorry I am not familiar with Perl.Thank you!

            1. Third time lucky:

              my ($junk,$id,$base_mean,$base_mean_a,$base_mean_b,$fold_change,$log_2_fold_change,$pval,$padj) = split(/\s+/);
              next unless $padj =~ /^[0-9.]+$/;
              next unless $padj <= 0.05;

              It will skip lines when $padj isn't made up of digits and periods.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.