edgeR vs. DESeq using pnas_expression.txt

Firstly from Davis's homepage download the file pnas_expression.txt. For more information on the dataset please refer to the edgeR manual and this paper.

The latest R version at the time of writing is R 2.13.1. You can download it from here.

Install bioconductor and the required packages:

source(“http://www.bioconductor.org/biocLite.R”)
biocLite()
biocLite(“DESeq”)
biocLite(“edgeR”)

A filtering criteria of keeping "only those tags that have at least one count per million in at least three samples" was specified in the edgeR manual. As I'm no good with R, I will preprocess the pnas_expression.txt file with Perl imposing the same filter and also removing the "len" column.

#!/usr/bin/perl

use strict;
use warnings;

my $infile = 'pnas_expression.txt';

open(IN,'<',$infile) || die "Could not open $infile: $!\n";

my @library_size = ();

while(<IN>){
   chomp;
   next if /^ensembl_ID/;
   #ensembl_ID      lane1   lane2   lane3   lane4   lane5   lane6   lane8   len
   #ENSG00000215696 0       0       0       0       0       0       0       330
   my @line_split = split(/\t/);
   for (my $i = 1; $i<8; ++$i){
      $library_size[$i] += $line_split[$i];
   }
}
close(IN);

open(IN,'<',$infile) || die "Could not open $infile: $!\n";
while(<IN>){
   chomp;
   my @line_split = split(/\t/);
   pop(@line_split);
   if (/^ensembl_ID/){
      print join("\t",@line_split),"\n";
      next;
   }
   my $checkpoint = '0';
   for (my $i = 1; $i<8; ++$i){
      my $tpm = $line_split[$i] * 1_000_000 / $library_size[$i];
      ++$checkpoint if $tpm >= 1;
   }
   print join ("\t",@line_split),"\n" if $checkpoint >= 3;
}
close(IN);

exit(0);

__END__

filter.pl > pnas_expression_filtered.tsv

library(edgeR)
raw.data <- read.delim("pnas_expression_filtered.tsv")
d <- raw.data[,2:8]
rownames(d) <- raw.data[,1]
group <- c(rep("Control",4),rep("DHT",3))
d <- DGEList(counts = d, group=group)
dim(d)
[1] 16494     7
d <- calcNormFactors(d)
d
An object of class "DGEList"
$samples
        group lib.size norm.factors
lane1 Control   976847    1.0296636
lane2 Control  1154746    1.0372521
lane3 Control  1439393    1.0362662
lane4 Control  1482652    1.0378383
lane5     DHT  1820628    0.9537095
lane6     DHT  1831553    0.9525624
lane8     DHT   680798    0.9583181

$counts
                lane1 lane2 lane3 lane4 lane5 lane6 lane8
ENSG00000124208   478   619   628   744   483   716   240
ENSG00000182463    27    20    27    26    48    55    24
ENSG00000124201   180   218   293   275   373   301    88
ENSG00000124207    76    80    85    97    80    81    37
ENSG00000125835   132   200   200   228   280   204    52
16489 more rows ...

$all.zeros
ENSG00000124208 ENSG00000182463 ENSG00000124201 ENSG00000124207 ENSG00000125835 
          FALSE           FALSE           FALSE           FALSE           FALSE 
16489 more elements ...

plotMDS.dge(d, main = "MDS Plot for Li Data", xlim = c(-1, 1),labels = c("Control1","Control2","Control3","Control4","DHT1", "DHT2", "DHT3"))

On dimension 2, DHT3 is quite different from the other DHT samples. It should be noted that the library size of DHT3 is three times smaller than DHT1 and DHT2.

Meanwhile in DESeq:

library("DESeq")
countsTable <- read.delim("pnas_expression_filtered.tsv", header=TRUE, stringsAsFactors=TRUE)
rownames(countsTable) <- countsTable$ensembl_ID
countsTable <- countsTable[,-1]
conds <- c("N","N","N","N","T","T","T")
cds <- newCountDataSet (countsTable, conds)
cds <- estimateSizeFactors(cds)
sizeFactors(cds)
    lane1     lane2     lane3     lane4     lane5     lane6     lane8 
0.7911762 0.9433534 1.1885656 1.2310312 1.4100653 1.4225280 0.5140726

For more information on how DESeq calculates the size factors, see this post. I attempted to calculate these values manually by following their instructions and I got 0.726441658 0.87339311 1.103958663 1.141646857 1.276470588 1.285666056 0.449085824. Slightly off, perhaps due to Excel or more likely due to something I did wrong.

On the other hand, edgeR has these normalisation factors:

group lib.size norm.factors
lane1 Control 976847 1.0296636
lane2 Control 1154746 1.0372521
lane3 Control 1439393 1.0362662
lane4 Control 1482652 1.0378383
lane5 DHT 1820628 0.9537095
lane6 DHT 1831553 0.9525624
lane8 DHT 680798 0.9583181

See this post for more information on the norm.factors.

Before calling differential expression on any of the genes, it is important to estimate the variance. Both methods use the negative binomial distribution to model the data as the level of noise is beyond what the Poisson model can account for due to biological variance.

Using edgeR to calculate the variance

d <- estimateCommonDisp(d)
d$samples$lib.size
[1]  976847 1154746 1439393 1482652 1820628 1831553  680798
d$common.lib.size
[1] 1274589
colSums(d$pseudo.alt)
  lane1   lane2   lane3   lane4   lane5   lane6   lane8 
1237946 1228854 1230106 1228297 1336795 1338335 1331880 
d$common.dispersion
[1] 0.01999041
sqrt(d$common.dispersion)
[1] 0.1413875

The common dispersion is the "squared coefficient of variation", where the coefficient of variation gives the amount of variability in the true abundance for each gene between replicates. The common dispersion estimate is ~0.02, which (take square root) gives an estimate of the coefficient of variation of 0.14. The true abundance for each gene can vary up or down by 14% between replicates.

Using DESeq:

cds <- estimateVarianceFunctions(cds)
scvPlot(cs,ylim=c(0,2))

The x axis is the base mean, calculated using the size factors and the y axis is the squared coefficient of variation, the ratio of the variance at base level to the square of the base mean. The solid lines are the SCV for the raw variances. The dotted line is the base variance, calculated using the size factors. The solid black line is a density estimate of the base means; only where there is an appreciable number of base mean values, the variance estimates can be expected to be accurate. Observe the decrease in variance as the base mean of the genes increases.

diagForT <-  varianceFitDiagnostics(cds,"T")
head(diagForT)
                 baseMean    baseVar fittedRawVar fittedBaseVar    pchisq
ENSG00000124208 437.57559 7106.70216   2133.97917    2623.68628 0.9333747
ENSG00000182463  39.79685   40.93749     50.62900      95.16714 0.3495977
ENSG00000124201 215.76798 2191.36827    668.79704     910.27101 0.9099491
ENSG00000124207  61.88337   76.38028     96.49555     165.75153 0.3692287
ENSG00000125835 147.71069 2386.52531    366.69439     532.00292 0.9887335
ENSG00000125834  73.61848  310.50106    125.27327     207.66245 0.7758007
smoothScatter (log10(diagForT$baseMean), log10(diagForT$baseVar))
lines (log10(fittedBaseVar) ~ log10(baseMean), diagForT[order(diagForT$baseMean),],col="red")
residualsEcdfPlot(cds,"T")
residualsEcdfPlot(cds,"N")

The fitted base variance is the predicted value from the local fit through the base variance estimates from all genes.

The base mean is proportional to the base variance until there is high biological variability with abundantly expressed genes.

The pchisq column from the varianceFitDiagnostics() function is used to determine whether single-gene estimate of the base variance deviates too much from the fitted value (which is calculated by taking the variance from all genes into account). The above plots show an empirical cumulative density function (ECDF) of the pchisq values stratified by base level. Basically if the curves are above the green line, where variance is overestimated, there is nothing to worry about.

Calling DE using edgeR and using a common dispersion

de.com <- exactTest(d)
Comparison of groups:  DHT - Control 
topTags(de.com)
Comparison of groups: DHT-Control 
                  logConc    logFC       P.Value           FDR
ENSG00000151503 -11.93733 5.821695 6.831306e-193 1.126756e-188
ENSG00000096060 -11.32227 5.009483 1.311104e-162 1.081267e-158
ENSG00000127954 -15.62219 8.235552 2.461375e-153 1.353264e-149
ENSG00000166451 -12.27691 4.686616 1.108080e-134 4.569166e-131
ENSG00000131016 -14.41810 5.306943 4.156214e-110 1.371052e-106
ENSG00000113594 -12.82287 4.116740 5.371074e-102  1.476508e-98
ENSG00000116285 -13.55664 4.204769  4.673617e-93  1.101238e-89
ENSG00000123983 -12.08499 3.660756  1.263258e-92  2.604521e-89
ENSG00000166086 -15.23487 5.507542  1.925054e-90  3.527982e-87
ENSG00000162772 -10.80630 3.318369  7.379002e-86  1.217093e-82
summary(decideTestsDGE(de.com, p.value=0.01))
   [,1] 
-1  1621
0  13110
1   1763
summary(decideTestsDGE(de.com, p.value=0.05))
   [,1] 
-1  2467
0  11555
1   2472

At an adjusted p-value of 0.05, 4,939 are differentially expressed.

d <- estimateTagwiseDisp(d,prior.n=10,trend=TRUE, prop.used=0.3, grid.length = 500)
summary(d$tagwise.dispersion)
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
0.009082 0.021450 0.040580 0.062720 0.097690 0.383100
plot(log2(d$conc$conc.common), d$tagwise.dispersion, panel.first = grid())
abline(h=d$common.dispersion, col="dodgerblue",lwd=3)

Dispersion decreases with tag concentration.

de.tgw <- exactTest (d,common.disp=FALSE)
Comparison of groups:  DHT - Control 
topTags(de.tgw)
Comparison of groups: DHT-Control 
                   logConc    logFC       P.Value           FDR
ENSG00000151503 -11.936602 5.820621 6.372940e-312 1.051153e-307
ENSG00000096060 -11.321444 5.007775 3.937429e-270 3.247198e-266
ENSG00000166451 -12.278413 4.686846 1.314101e-213 7.224927e-210
ENSG00000127954 -15.621048 8.234180 1.513030e-191 6.238980e-188
ENSG00000113594 -12.824797 4.115144 4.910925e-156 1.620016e-152
ENSG00000162772 -10.805702 3.317800 6.173781e-145 1.697172e-141
ENSG00000123983 -12.085611 3.657763 1.283904e-129 3.025245e-126
ENSG00000116133 -11.730415 3.245004 1.530499e-126 3.155506e-123
ENSG00000115648  -8.820711 2.598023 1.686581e-124 3.090942e-121
ENSG00000116285 -13.556232 4.206675 2.656142e-122 4.381040e-119
summary(decideTestsDGE(de.tgw,p.value=0.05))
   [,1] 
-1  2124
0  12054
1   2316
mv <- plotMeanVar(d, show.raw.vars = TRUE, show.tagwise.vars=TRUE,dispersion.method="qcml",NBline=TRUE)

See smooth scatter plot above.

Calling differential expression using DESeq

res <- nbinomTest( cds,"N","T")
resSig <- res[res$padj < 0.05,]
nrow(resSig)
[1] 4965

Taking the top 100 differentially expressed genes (ordered by p-value) from both lists there are only 88 overlaps. Taking the top 500, there are 450 overlaps.




Creative Commons License
This work is licensed under a Creative Commons
Attribution 4.0 International License
.
9 comments Add yours
  1. Looks like in edgeR when calling significantly DE genes “summary(decideTestsDGE(de.com, p.value=0.05))” you are using the raw p.value, not the adjusted p.value. You’d need to add: adjust.method=”BH” or similar to get the adjusted val.

      1. Hi Seth,

        Thanks for the comments. It’s good to point that it is the default just in case others were wondering too. So no need to ignore 🙂

        Cheers,

        Dave

  2. Hello, the information is very helpful. I’m very curious why the overlap is so small of the top ranked p.values between DESeq and edgeR results. May I have your ideas?

    Besides, what is d$pseudo.alt? Is it the quantile normalized data? How can we tell, at each step when we input DEGlist d, which count data does the algorithm use? For example, does it use original raw read count, or pseudo.alt?

    1. Hi Yuan,

      Thanks for the comment. I just redid the analysis using R version 2.14.1, edgeR_2.4.1, DESeq_1.6.1 and Bioconductor version 2.9.

      The overlap between the top 100 genes was 88/100 and for the top 500 genes it was 450/500. I probably did not sort the DESeq output by the adjusted p-value in my original analysis (now that I realise the output is not sorted by default unlike edgeR). I will modify the post.

      From the edgeR manual,

      The element common.dispersion, as the name suggests, provides the estimate of the common dispersion, and pseudo.alt gives the pseudocounts calculated under the alternative hypothesis.

      The element common.lib.size gives the library size to which the original libraries have been adjusted in the pseudocounts.

      If I do not run:

      d <- calcNormFactors(d) The output becomes: > d$common.lib.size
      [1] 1274589
      > colSums(d$pseudo.alt)
      lane1 lane2 lane3 lane4 lane5 lane6 lane8
      1274905 1274630 1274702 1274787 1274935 1274798 1275881

      The manual is much more informative than my blog will ever be. If you still have questions, you should email bioconductor at r-project.org, and most likely get a reply from the authors of the packages.

      Cheers,

      Dave

  3. Do you happen to know how to get the genes that contribute most to the two axes out from the MDS?

    Cheers,

    Seth

    1. Hi Seth,

      For the MDS plot I don’t know. However you could perform a PCA using R, then have a look at the scores of each feature at the different components, for example:


      data <- read.table("some.file") data.pca <- prcomp(data) data.pca$x

      I don't know the difference between a MDS analysis versus a PCA, but they seem so be similar.

      Hope that helps,

      Dave

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.