Firstly from Davis’s homepage download the file pnas_expression.txt. For more information on the dataset please refer to the edgeR manual.
The latest R version at the time of writing is R 2.13.1. You can download it from here.
Install bioconductor and the required packages:
source(“http://www.bioconductor.org/biocLite.R”)
biocLite()
biocLite(“DESeq”)
biocLite(“edgeR”)
A filtering criteria of keeping “only those tags that have at least one count per million in at least three samples” was specified in the edgeR manual. As I’m no good with R, I will preprocess the pnas_expression.txt file with Perl imposing the same filter and also removing the “len” column.
#!/usr/bin/perl
use strict;
use warnings;
my $infile = 'pnas_expression.txt';
open(IN,'<',$infile) || die "Could not open $infile: $!\n";
my @library_size = ();
while(<IN>){
chomp;
next if /^ensembl_ID/;
#ensembl_ID lane1 lane2 lane3 lane4 lane5 lane6 lane8 len
#ENSG00000215696 0 0 0 0 0 0 0 330
my @line_split = split(/\t/);
for (my $i = 1; $i<8; ++$i){
$library_size[$i] += $line_split[$i];
}
}
close(IN);
open(IN,'<',$infile) || die "Could not open $infile: $!\n";
while(<IN>){
chomp;
my @line_split = split(/\t/);
pop(@line_split);
if (/^ensembl_ID/){
print join("\t",@line_split),"\n";
next;
}
my $checkpoint = '0';
for (my $i = 1; $i<8; ++$i){
my $tpm = $line_split[$i] * 1_000_000 / $library_size[$i];
++$checkpoint if $tpm >= 1;
}
print join ("\t",@line_split),"\n" if $checkpoint >= 3;
}
close(IN);
exit(0);
__END__
filter.pl > pnas_expression_filtered.tsv
library(edgeR)
raw.data <- read.delim("pnas_expression_filtered.tsv")
d <- raw.data[,2:8]
rownames(d) <- raw.data[,1]
group <- c(rep("Control",4),rep("DHT",3))
d <- DGEList(counts = d, group=group)
dim(d)
[1] 16494 7
d <- calcNormFactors(d)
d
An object of class "DGEList"
$samples
group lib.size norm.factors
lane1 Control 976847 1.0296636
lane2 Control 1154746 1.0372521
lane3 Control 1439393 1.0362662
lane4 Control 1482652 1.0378383
lane5 DHT 1820628 0.9537095
lane6 DHT 1831553 0.9525624
lane8 DHT 680798 0.9583181
$counts
lane1 lane2 lane3 lane4 lane5 lane6 lane8
ENSG00000124208 478 619 628 744 483 716 240
ENSG00000182463 27 20 27 26 48 55 24
ENSG00000124201 180 218 293 275 373 301 88
ENSG00000124207 76 80 85 97 80 81 37
ENSG00000125835 132 200 200 228 280 204 52
16489 more rows ...
$all.zeros
ENSG00000124208 ENSG00000182463 ENSG00000124201 ENSG00000124207 ENSG00000125835
FALSE FALSE FALSE FALSE FALSE
16489 more elements ...
plotMDS.dge(d, main = "MDS Plot for Li Data", xlim = c(-1, 1),labels = c("Control1","Control2","Control3","Control4","DHT1", "DHT2", "DHT3"))
On dimension 2, DHT3 is quite different from the other DHT samples. It should be noted that the library size of DHT3 is three times smaller than DHT1 and DHT2.
Meanwhile in DESeq:
library("DESeq")
countsTable <- read.delim("pnas_expression_filtered.tsv", header=TRUE, stringsAsFactors=TRUE)
rownames(countsTable) <- countsTable$ensembl_ID
countsTable <- countsTable[,-1]
conds <- c("N","N","N","N","T","T","T")
cds <- newCountDataSet (countsTable, conds)
cds <- estimateSizeFactors(cds)
sizeFactors(cds)
lane1 lane2 lane3 lane4 lane5 lane6 lane8
0.7911762 0.9433534 1.1885656 1.2310312 1.4100653 1.4225280 0.5140726
For more information on how DESeq calculates the size factors, see this post. I attempted to calculate these values manually by following their instructions and I got 0.726441658 0.87339311 1.103958663 1.141646857 1.276470588 1.285666056 0.449085824. Slightly off, perhaps due to Excel or more likely due to something I did wrong.
On the other hand, edgeR has these normalisation factors:
group lib.size norm.factors
lane1 Control 976847 1.0296636
lane2 Control 1154746 1.0372521
lane3 Control 1439393 1.0362662
lane4 Control 1482652 1.0378383
lane5 DHT 1820628 0.9537095
lane6 DHT 1831553 0.9525624
lane8 DHT 680798 0.9583181
See this post for more information on the norm.factors.
Before calling differential expression on any of the genes, it is important to estimate the variance. Both methods use the negative binomial distribution to model the data as the level of noise is beyond what the Poisson model can account for due to biological variance.
Using edgeR to calculate the variance
d <- estimateCommonDisp(d) d$samples$lib.size [1] 976847 1154746 1439393 1482652 1820628 1831553 680798 d$common.lib.size [1] 1274589 colSums(d$pseudo.alt) lane1 lane2 lane3 lane4 lane5 lane6 lane8 1237946 1228854 1230106 1228297 1336795 1338335 1331880 d$common.dispersion [1] 0.01999041 sqrt(d$common.dispersion) [1] 0.1413875
The common dispersion is the “squared coefficient of variation”, where the coefficient of variation gives the amount of variability in the true abundance for each gene between replicates. The common dispersion estimate is ~0.02, which (take square root) gives an estimate of the coefficient of variation of 0.14. The true abundance for each gene can vary up or down by 14% between replicates.
Using DESeq:
cds <- estimateVarianceFunctions(cds) scvPlot(cs,ylim=c(0,2))
The x axis is the base mean, calculated using the size factors and the y axis is the squared coefficient of variation, the ratio of the variance at base level to the square of the base mean. The solid lines are the SCV for the raw variances. The dotted line is the base variance, calculated using the size factors. The solid black line is a density estimate of the base means; only where there is an appreciable number of base mean values, the variance estimates can be expected to be accurate. Observe the decrease in variance as the base mean of the genes increases.
diagForT <- varianceFitDiagnostics(cds,"T")
head(diagForT)
baseMean baseVar fittedRawVar fittedBaseVar pchisq
ENSG00000124208 437.57559 7106.70216 2133.97917 2623.68628 0.9333747
ENSG00000182463 39.79685 40.93749 50.62900 95.16714 0.3495977
ENSG00000124201 215.76798 2191.36827 668.79704 910.27101 0.9099491
ENSG00000124207 61.88337 76.38028 96.49555 165.75153 0.3692287
ENSG00000125835 147.71069 2386.52531 366.69439 532.00292 0.9887335
ENSG00000125834 73.61848 310.50106 125.27327 207.66245 0.7758007
smoothScatter (log10(diagForT$baseMean), log10(diagForT$baseVar))
lines (log10(fittedBaseVar) ~ log10(baseMean), diagForT[order(diagForT$baseMean),],col="red")
residualsEcdfPlot(cds,"T")
residualsEcdfPlot(cds,"N")
The fitted base variance is the predicted value from the local fit through the base variance estimates from all genes.
The base mean is proportional to the base variance until there is high biological variability with abundantly expressed genes.
The pchisq column from the varianceFitDiagnostics() function is used to determine whether single-gene estimate of the base variance deviates too much from the fitted value (which is calculated by taking the variance from all genes into account). The above plots show an empirical cumulative density function (ECDF) of the pchisq values stratified by base level. Basically if the curves are above the green line, where variance is overestimated, there is nothing to worry about.
Calling DE using edgeR and using a common dispersion
de.com <- exactTest(d)
Comparison of groups: DHT - Control
topTags(de.com)
Comparison of groups: DHT-Control
logConc logFC P.Value FDR
ENSG00000151503 -11.93733 5.821695 6.831306e-193 1.126756e-188
ENSG00000096060 -11.32227 5.009483 1.311104e-162 1.081267e-158
ENSG00000127954 -15.62219 8.235552 2.461375e-153 1.353264e-149
ENSG00000166451 -12.27691 4.686616 1.108080e-134 4.569166e-131
ENSG00000131016 -14.41810 5.306943 4.156214e-110 1.371052e-106
ENSG00000113594 -12.82287 4.116740 5.371074e-102 1.476508e-98
ENSG00000116285 -13.55664 4.204769 4.673617e-93 1.101238e-89
ENSG00000123983 -12.08499 3.660756 1.263258e-92 2.604521e-89
ENSG00000166086 -15.23487 5.507542 1.925054e-90 3.527982e-87
ENSG00000162772 -10.80630 3.318369 7.379002e-86 1.217093e-82
summary(decideTestsDGE(de.com, p.value=0.01))
[,1]
-1 1621
0 13110
1 1763
summary(decideTestsDGE(de.com, p.value=0.05))
[,1]
-1 2467
0 11555
1 2472
At an adjusted p-value of 0.05, 4,939 are differentially expressed.
d <- estimateTagwiseDisp(d,prior.n=10,trend=TRUE, prop.used=0.3, grid.length = 500)
summary(d$tagwise.dispersion)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.009082 0.021450 0.040580 0.062720 0.097690 0.383100
plot(log2(d$conc$conc.common), d$tagwise.dispersion, panel.first = grid())
abline(h=d$common.dispersion, col="dodgerblue",lwd=3)
Dispersion decreases with tag concentration.
de.tgw <- exactTest (d,common.disp=FALSE)
Comparison of groups: DHT - Control
topTags(de.tgw)
Comparison of groups: DHT-Control
logConc logFC P.Value FDR
ENSG00000151503 -11.936602 5.820621 6.372940e-312 1.051153e-307
ENSG00000096060 -11.321444 5.007775 3.937429e-270 3.247198e-266
ENSG00000166451 -12.278413 4.686846 1.314101e-213 7.224927e-210
ENSG00000127954 -15.621048 8.234180 1.513030e-191 6.238980e-188
ENSG00000113594 -12.824797 4.115144 4.910925e-156 1.620016e-152
ENSG00000162772 -10.805702 3.317800 6.173781e-145 1.697172e-141
ENSG00000123983 -12.085611 3.657763 1.283904e-129 3.025245e-126
ENSG00000116133 -11.730415 3.245004 1.530499e-126 3.155506e-123
ENSG00000115648 -8.820711 2.598023 1.686581e-124 3.090942e-121
ENSG00000116285 -13.556232 4.206675 2.656142e-122 4.381040e-119
summary(decideTestsDGE(de.tgw,p.value=0.05))
[,1]
-1 2124
0 12054
1 2316
mv <- plotMeanVar(d, show.raw.vars = TRUE, show.tagwise.vars=TRUE,dispersion.method="qcml",NBline=TRUE)
See smooth scatter plot above.
Calling differential expression using DESeq
res <- nbinomTest( cds,"N","T") resSig <- res[res$padj < 0.05,] nrow(resSig) [1] 4965
Taking the top 100 differentially expressed genes (ordered by p-value) from both lists there are only 88 overlaps. Taking the top 500, there are 450 overlaps.







Looks like in edgeR when calling significantly DE genes “summary(decideTestsDGE(de.com, p.value=0.05))” you are using the raw p.value, not the adjusted p.value. You’d need to add: adjust.method=”BH” or similar to get the adjusted val.
woops, my bad. Looks like that is the default anyway. Ignore previous post (and this one too I guess!)
Hi Seth,
Thanks for the comments. It’s good to point that it is the default just in case others were wondering too. So no need to ignore
Cheers,
Dave
Hello, the information is very helpful. I’m very curious why the overlap is so small of the top ranked p.values between DESeq and edgeR results. May I have your ideas?
Besides, what is d$pseudo.alt? Is it the quantile normalized data? How can we tell, at each step when we input DEGlist d, which count data does the algorithm use? For example, does it use original raw read count, or pseudo.alt?
Hi Yuan,
Thanks for the comment. I just redid the analysis using R version 2.14.1, edgeR_2.4.1, DESeq_1.6.1 and Bioconductor version 2.9.
The overlap between the top 100 genes was 88/100 and for the top 500 genes it was 450/500. I probably did not sort the DESeq output by the adjusted p-value in my original analysis (now that I realise the output is not sorted by default unlike edgeR). I will modify the post.
From the edgeR manual,
The element common.dispersion, as the name suggests, provides the estimate of the common dispersion, and pseudo.alt gives the pseudocounts calculated under the alternative hypothesis.
The element common.lib.size gives the library size to which the original libraries have been adjusted in the pseudocounts.
If I do not run:
d < - calcNormFactors(d)
The output becomes:
> d$common.lib.size
[1] 1274589
> colSums(d$pseudo.alt)
lane1 lane2 lane3 lane4 lane5 lane6 lane8
1274905 1274630 1274702 1274787 1274935 1274798 1275881
The manual is much more informative than my blog will ever be. If you still have questions, you should email bioconductor at r-project.org, and most likely get a reply from the authors of the packages.
Cheers,
Dave
Thanks! –Yuan