edgeR vs. DESeq using pnas_expression.txt

Firstly from Davis's homepage download the file pnas_expression.txt. For more information on the dataset please refer to the edgeR manual and this paper.

The latest R version at the time of writing is R 2.13.1. You can download it from here.

Install bioconductor and the required packages:

source(“http://www.bioconductor.org/biocLite.R”)
biocLite()
biocLite(“DESeq”)
biocLite(“edgeR”)

A filtering criteria of keeping "only those tags that have at least one count per million in at least three samples" was specified in the edgeR manual. As I'm no good with R, I will preprocess the pnas_expression.txt file with Perl imposing the same filter and also removing the "len" column.

#!/usr/bin/perl

use strict;
use warnings;

my $infile = 'pnas_expression.txt';

open(IN,'<',$infile) || die "Could not open $infile: $!\n";

my @library_size = ();

while(<IN>){
   chomp;
   next if /^ensembl_ID/;
   #ensembl_ID      lane1   lane2   lane3   lane4   lane5   lane6   lane8   len
   #ENSG00000215696 0       0       0       0       0       0       0       330
   my @line_split = split(/\t/);
   for (my $i = 1; $i<8; ++$i){
      $library_size[$i] += $line_split[$i];
   }
}
close(IN);

open(IN,'<',$infile) || die "Could not open $infile: $!\n";
while(<IN>){
   chomp;
   my @line_split = split(/\t/);
   pop(@line_split);
   if (/^ensembl_ID/){
      print join("\t",@line_split),"\n";
      next;
   }
   my $checkpoint = '0';
   for (my $i = 1; $i<8; ++$i){
      my $tpm = $line_split[$i] * 1_000_000 / $library_size[$i];
      ++$checkpoint if $tpm >= 1;
   }
   print join ("\t",@line_split),"\n" if $checkpoint >= 3;
}
close(IN);

exit(0);

__END__

filter.pl > pnas_expression_filtered.tsv

library(edgeR)
raw.data <- read.delim("pnas_expression_filtered.tsv")
d <- raw.data[,2:8]
rownames(d) <- raw.data[,1]
group <- c(rep("Control",4),rep("DHT",3))
d <- DGEList(counts = d, group=group)
dim(d)
[1] 16494     7
d <- calcNormFactors(d)
d
An object of class "DGEList"
$samples
        group lib.size norm.factors
lane1 Control   976847    1.0296636
lane2 Control  1154746    1.0372521
lane3 Control  1439393    1.0362662
lane4 Control  1482652    1.0378383
lane5     DHT  1820628    0.9537095
lane6     DHT  1831553    0.9525624
lane8     DHT   680798    0.9583181

$counts
                lane1 lane2 lane3 lane4 lane5 lane6 lane8
ENSG00000124208   478   619   628   744   483   716   240
ENSG00000182463    27    20    27    26    48    55    24
ENSG00000124201   180   218   293   275   373   301    88
ENSG00000124207    76    80    85    97    80    81    37
ENSG00000125835   132   200   200   228   280   204    52
16489 more rows ...

$all.zeros
ENSG00000124208 ENSG00000182463 ENSG00000124201 ENSG00000124207 ENSG00000125835 
          FALSE           FALSE           FALSE           FALSE           FALSE 
16489 more elements ...

plotMDS.dge(d, main = "MDS Plot for Li Data", xlim = c(-1, 1),labels = c("Control1","Control2","Control3","Control4","DHT1", "DHT2", "DHT3"))

On dimension 2, DHT3 is quite different from the other DHT samples. It should be noted that the library size of DHT3 is three times smaller than DHT1 and DHT2.

Meanwhile in DESeq:

library("DESeq")
countsTable <- read.delim("pnas_expression_filtered.tsv", header=TRUE, stringsAsFactors=TRUE)
rownames(countsTable) <- countsTable$ensembl_ID
countsTable <- countsTable[,-1]
conds <- c("N","N","N","N","T","T","T")
cds <- newCountDataSet (countsTable, conds)
cds <- estimateSizeFactors(cds)
sizeFactors(cds)
    lane1     lane2     lane3     lane4     lane5     lane6     lane8 
0.7911762 0.9433534 1.1885656 1.2310312 1.4100653 1.4225280 0.5140726

For more information on how DESeq calculates the size factors, see this post. I attempted to calculate these values manually by following their instructions and I got 0.726441658 0.87339311 1.103958663 1.141646857 1.276470588 1.285666056 0.449085824. Slightly off, perhaps due to Excel or more likely due to something I did wrong.

On the other hand, edgeR has these normalisation factors:

group lib.size norm.factors
lane1 Control 976847 1.0296636
lane2 Control 1154746 1.0372521
lane3 Control 1439393 1.0362662
lane4 Control 1482652 1.0378383
lane5 DHT 1820628 0.9537095
lane6 DHT 1831553 0.9525624
lane8 DHT 680798 0.9583181

See this post for more information on the norm.factors.

Before calling differential expression on any of the genes, it is important to estimate the variance. Both methods use the negative binomial distribution to model the data as the level of noise is beyond what the Poisson model can account for due to biological variance.

Using edgeR to calculate the variance

d <- estimateCommonDisp(d)
d$samples$lib.size
[1]  976847 1154746 1439393 1482652 1820628 1831553  680798
d$common.lib.size
[1] 1274589
colSums(d$pseudo.alt)
  lane1   lane2   lane3   lane4   lane5   lane6   lane8 
1237946 1228854 1230106 1228297 1336795 1338335 1331880 
d$common.dispersion
[1] 0.01999041
sqrt(d$common.dispersion)
[1] 0.1413875

The common dispersion is the "squared coefficient of variation", where the coefficient of variation gives the amount of variability in the true abundance for each gene between replicates. The common dispersion estimate is ~0.02, which (take square root) gives an estimate of the coefficient of variation of 0.14. The true abundance for each gene can vary up or down by 14% between replicates.

Using DESeq:

cds <- estimateVarianceFunctions(cds)
scvPlot(cs,ylim=c(0,2))

The x axis is the base mean, calculated using the size factors and the y axis is the squared coefficient of variation, the ratio of the variance at base level to the square of the base mean. The solid lines are the SCV for the raw variances. The dotted line is the base variance, calculated using the size factors. The solid black line is a density estimate of the base means; only where there is an appreciable number of base mean values, the variance estimates can be expected to be accurate. Observe the decrease in variance as the base mean of the genes increases.

diagForT <-  varianceFitDiagnostics(cds,"T")
head(diagForT)
                 baseMean    baseVar fittedRawVar fittedBaseVar    pchisq
ENSG00000124208 437.57559 7106.70216   2133.97917    2623.68628 0.9333747
ENSG00000182463  39.79685   40.93749     50.62900      95.16714 0.3495977
ENSG00000124201 215.76798 2191.36827    668.79704     910.27101 0.9099491
ENSG00000124207  61.88337   76.38028     96.49555     165.75153 0.3692287
ENSG00000125835 147.71069 2386.52531    366.69439     532.00292 0.9887335
ENSG00000125834  73.61848  310.50106    125.27327     207.66245 0.7758007
smoothScatter (log10(diagForT$baseMean), log10(diagForT$baseVar))
lines (log10(fittedBaseVar) ~ log10(baseMean), diagForT[order(diagForT$baseMean),],col="red")
residualsEcdfPlot(cds,"T")
residualsEcdfPlot(cds,"N")

The fitted base variance is the predicted value from the local fit through the base variance estimates from all genes.

The base mean is proportional to the base variance until there is high biological variability with abundantly expressed genes.

The pchisq column from the varianceFitDiagnostics() function is used to determine whether single-gene estimate of the base variance deviates too much from the fitted value (which is calculated by taking the variance from all genes into account). The above plots show an empirical cumulative density function (ECDF) of the pchisq values stratified by base level. Basically if the curves are above the green line, where variance is overestimated, there is nothing to worry about.

Calling DE using edgeR and using a common dispersion

de.com <- exactTest(d)
Comparison of groups:  DHT - Control 
topTags(de.com)
Comparison of groups: DHT-Control 
                  logConc    logFC       P.Value           FDR
ENSG00000151503 -11.93733 5.821695 6.831306e-193 1.126756e-188
ENSG00000096060 -11.32227 5.009483 1.311104e-162 1.081267e-158
ENSG00000127954 -15.62219 8.235552 2.461375e-153 1.353264e-149
ENSG00000166451 -12.27691 4.686616 1.108080e-134 4.569166e-131
ENSG00000131016 -14.41810 5.306943 4.156214e-110 1.371052e-106
ENSG00000113594 -12.82287 4.116740 5.371074e-102  1.476508e-98
ENSG00000116285 -13.55664 4.204769  4.673617e-93  1.101238e-89
ENSG00000123983 -12.08499 3.660756  1.263258e-92  2.604521e-89
ENSG00000166086 -15.23487 5.507542  1.925054e-90  3.527982e-87
ENSG00000162772 -10.80630 3.318369  7.379002e-86  1.217093e-82
summary(decideTestsDGE(de.com, p.value=0.01))
   [,1] 
-1  1621
0  13110
1   1763
summary(decideTestsDGE(de.com, p.value=0.05))
   [,1] 
-1  2467
0  11555
1   2472

At an adjusted p-value of 0.05, 4,939 are differentially expressed.

d <- estimateTagwiseDisp(d,prior.n=10,trend=TRUE, prop.used=0.3, grid.length = 500)
summary(d$tagwise.dispersion)
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
0.009082 0.021450 0.040580 0.062720 0.097690 0.383100
plot(log2(d$conc$conc.common), d$tagwise.dispersion, panel.first = grid())
abline(h=d$common.dispersion, col="dodgerblue",lwd=3)

Dispersion decreases with tag concentration.

de.tgw <- exactTest (d,common.disp=FALSE)
Comparison of groups:  DHT - Control 
topTags(de.tgw)
Comparison of groups: DHT-Control 
                   logConc    logFC       P.Value           FDR
ENSG00000151503 -11.936602 5.820621 6.372940e-312 1.051153e-307
ENSG00000096060 -11.321444 5.007775 3.937429e-270 3.247198e-266
ENSG00000166451 -12.278413 4.686846 1.314101e-213 7.224927e-210
ENSG00000127954 -15.621048 8.234180 1.513030e-191 6.238980e-188
ENSG00000113594 -12.824797 4.115144 4.910925e-156 1.620016e-152
ENSG00000162772 -10.805702 3.317800 6.173781e-145 1.697172e-141
ENSG00000123983 -12.085611 3.657763 1.283904e-129 3.025245e-126
ENSG00000116133 -11.730415 3.245004 1.530499e-126 3.155506e-123
ENSG00000115648  -8.820711 2.598023 1.686581e-124 3.090942e-121
ENSG00000116285 -13.556232 4.206675 2.656142e-122 4.381040e-119
summary(decideTestsDGE(de.tgw,p.value=0.05))
   [,1] 
-1  2124
0  12054
1   2316
mv <- plotMeanVar(d, show.raw.vars = TRUE, show.tagwise.vars=TRUE,dispersion.method="qcml",NBline=TRUE)

See smooth scatter plot above.

Calling differential expression using DESeq

res <- nbinomTest( cds,"N","T")
resSig <- res[res$padj < 0.05,]
nrow(resSig)
[1] 4965

Taking the top 100 differentially expressed genes (ordered by p-value) from both lists there are only 88 overlaps. Taking the top 500, there are 450 overlaps.

Pearson vs. Spearman correlation

Correlation measures are commonly used to show how correlated two sets of datasets are. A commonly used measure is the Pearson correlation. To illustrate when not to use a Pearson correlation:

x = c(55,70,33,100,99,15,2,1,5,2000)
y = c(2,10,88,20,30,88,23,49,40,2000)

cor(x,y,method="pearson")
[1] 0.9957008
cor(x,y,method="spearman")
[1] -0.07294867
cor(log(x),log(y),method="pearson")
[1] 0.3556905

summary(x)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
   1.00    7.50   44.00  238.00   91.75 2000.00
summary(y)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
   2.00   20.75   35.00  235.00   78.25 2000.00

sd(x)
[1] 620.2714
sd(y)
[1] 620.8571

If we remove the 2,000 value:

x1 = c(55,70,33,100,99,15,2,1,5)
y1 = c(2,10,88,20,30,88,23,49,40)
cor(x1,y1,method="pearson")
[1] -0.4440288
cor(x1,y1,method="spearman")
[1] -0.4769916

Use a non-parametric correlation (e.g. Spearman's rank) measure if your dataset has outliers. It would probably be best to remove the outlier, since the negative correlation is further revealed afterwards in the Spearman's rank.

edgeR's common dispersion

Updated: 2014 April 18th

A post showing edgeR's common dispersion values from a set of values generated from various distributions. For an explanation of the common dispersion value in edgeR, please refer to this thread.

The generated values represent the counts of reads, i.e., expression level, for a gene. Each row will represent one gene and I will generate counts for 20,000 genes. Each column represents a sample and I will have 4 samples. Therefore the common dispersion calculations will be based on a matrix that has 20,000 rows and 4 columns.

Firstly, let's generate the average expression of all genes using the negative binomial distribution:

#negative binomial distribution of mean expression
#n    = number of observations
#size = target for number of successful trials
#prob = probability of success in each trial
set.seed(31)
mean <- rnbinom(n=20000, size=1, prob=0.0001)

#on average it takes ~10,000 times to have one success
#this makes sense because I set the probability to 0.0001
summary(mean)
#   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#      1    2803    6915    9898   13780  106300

plot(density(mean))

density_negative_binomialThis distribution is meant to represent the fact that few genes are very highly expressed.

What's the common dispersion when the expression of every gene is identical in 4 samples?

#install if necessary
source("http://bioconductor.org/biocLite.R")
biocLite('edgeR')

#load library
library(edgeR)

#identical expression
identical <- matrix(rep(x=mean, each=4),
                    nrow=length(mean),
                    byrow=T)

head(identical)
#      [,1]  [,2]  [,3]  [,4]
#[1,]  5386  5386  5386  5386
#[2,]  4282  4282  4282  4282
#[3,]  2243  2243  2243  2243
#[4,] 12830 12830 12830 12830
#[5,] 11962 11962 11962 11962
#[6,] 24473 24473 24473 24473

group <- c(rep("Control",4))
d <- DGEList(counts =  identical, group=group)
d <- calcNormFactors(d)
d <- estimateCommonDisp(d)
d <- estimateTagwiseDisp(d)

d$common.dispersion
#[1] 0.0001005378

summary(d$tagwise.dispersion)
#     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
#1.571e-06 1.571e-06 1.571e-06 1.571e-06 1.571e-06 1.571e-06

How about counts following a Poisson distribution:

set.seed(31)
poisson <- t(sapply(mean, rpois, n=4))
head(poisson)
#      [,1]  [,2]  [,3]  [,4]
#[1,]  5390  5372  5377  5496
#[2,]  4252  4138  4198  4200
#[3,]  2227  2164  2196  2263
#[4,] 12883 12918 12741 12643
#[5,] 12014 11962 11731 11970
#[6,] 24839 24475 24443 24436

group <- c(rep("Control",4))
d <- DGEList(counts =  poisson, group=group)
d <- calcNormFactors(d)
d <- estimateCommonDisp(d)
d <- estimateTagwiseDisp(d)
d$common.dispersion
#[1] 0.0001005378
summary(d$tagwise.dispersion)
#     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
#1.571e-06 1.571e-06 1.571e-06 1.672e-05 9.963e-06 1.687e-03

The common dispersion between counts that were identical and counts that followed a Poisson distribution was identical.

Lastly, let's generate some counts using the negative binomial distribution:

#from the R documentation of rnbinom
#An alternative parametrization (often used in ecology) is by the mean mu,
#and size, the dispersion parameter, where prob = size/(size+mu).
#The variance is mu + mu^2/size in this parametrization.
set.seed(31)
nb <- t(mapply(rnbinom, n=rep(4, 20000), size=rep(10,20000), mu=mean))
head(nb)
#      [,1]  [,2]  [,3]  [,4]
#[1,]  5196  5024  4259  3149
#[2,]  3536  2888  4768  3010
#[3,]  2476   924  4058  1993
#[4,] 15688  8399  6963 12968
#[5,] 12321  8653 11610 11104
#[6,] 26923 27319 24340 22717

#calculate means
nb_mean <- apply(blah, 1, mean)

#the means of the generated counts correlate
#to the original means
cor(nb_mean, mean, method="spearman")
#[1] 0.9888302

group <- c(rep("Control",4))
d <- DGEList(counts =  nb, group=group)
d <- calcNormFactors(d)
d <- estimateCommonDisp(d)
d <- estimateTagwiseDisp(d)
d$common.dispersion
#[1] 0.1001726
summary(d$tagwise.dispersion)
#   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#0.07591 0.08641 0.09526 0.10010 0.10870 0.22200 

The common dispersion is higher when the read counts were generated from a negative binomial distribution. Lastly to understand the mu parameter of the rnbinom() function:

#mean mu
my_mu <- 500
my_size <- 1
my_n <- 1000
set.seed(31)
test <- rnbinom(n=my_n, size=my_size, mu=my_mu)
summary(test)
#   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#    0.0   149.8   338.0   498.2   715.0  3126.0 
var(test)
#[1] 245942.2
#dispersion
(my_mu + (my_mu^2/my_size))
#[1] 250500
#prob
my_size/(my_size+my_mu)
#[1] 0.001996008
plot(density(test))

nb_mu_500

The mean of the sampled read counts (498.2) is close what we specified as mu (500) and the variance of the sampled read counts (245942.2) is close to the dispersion parameter (my_mu + (my_mu^2/my_size)). The smaller the size parameter and the larger mu is, the larger the dispersion.

sessionInfo()
R version 3.1.0 (2014-04-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)

locale:
[1] LC_COLLATE=English_Australia.1252  LC_CTYPE=English_Australia.1252   
[3] LC_MONETARY=English_Australia.1252 LC_NUMERIC=C                      
[5] LC_TIME=English_Australia.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] MASS_7.3-31  edgeR_3.6.0  limma_3.20.1

loaded via a namespace (and not attached):
[1] tools_3.1.0