Firstly from Davis's homepage download the file pnas_expression.txt. For more information on the dataset please refer to the edgeR manual and this paper.
The latest R version at the time of writing is R 2.13.1. You can download it from here.
Install bioconductor and the required packages:
source(“http://www.bioconductor.org/biocLite.R”)
biocLite()
biocLite(“DESeq”)
biocLite(“edgeR”)
A filtering criteria of keeping "only those tags that have at least one count per million in at least three samples" was specified in the edgeR manual. As I'm no good with R, I will preprocess the pnas_expression.txt file with Perl imposing the same filter and also removing the "len" column.
#!/usr/bin/perl use strict; use warnings; my $infile = 'pnas_expression.txt'; open(IN,'<',$infile) || die "Could not open $infile: $!\n"; my @library_size = (); while(<IN>){ chomp; next if /^ensembl_ID/; #ensembl_ID lane1 lane2 lane3 lane4 lane5 lane6 lane8 len #ENSG00000215696 0 0 0 0 0 0 0 330 my @line_split = split(/\t/); for (my $i = 1; $i<8; ++$i){ $library_size[$i] += $line_split[$i]; } } close(IN); open(IN,'<',$infile) || die "Could not open $infile: $!\n"; while(<IN>){ chomp; my @line_split = split(/\t/); pop(@line_split); if (/^ensembl_ID/){ print join("\t",@line_split),"\n"; next; } my $checkpoint = '0'; for (my $i = 1; $i<8; ++$i){ my $tpm = $line_split[$i] * 1_000_000 / $library_size[$i]; ++$checkpoint if $tpm >= 1; } print join ("\t",@line_split),"\n" if $checkpoint >= 3; } close(IN); exit(0); __END__
filter.pl > pnas_expression_filtered.tsv
library(edgeR) raw.data <- read.delim("pnas_expression_filtered.tsv") d <- raw.data[,2:8] rownames(d) <- raw.data[,1] group <- c(rep("Control",4),rep("DHT",3)) d <- DGEList(counts = d, group=group) dim(d) [1] 16494 7 d <- calcNormFactors(d) d An object of class "DGEList" $samples group lib.size norm.factors lane1 Control 976847 1.0296636 lane2 Control 1154746 1.0372521 lane3 Control 1439393 1.0362662 lane4 Control 1482652 1.0378383 lane5 DHT 1820628 0.9537095 lane6 DHT 1831553 0.9525624 lane8 DHT 680798 0.9583181 $counts lane1 lane2 lane3 lane4 lane5 lane6 lane8 ENSG00000124208 478 619 628 744 483 716 240 ENSG00000182463 27 20 27 26 48 55 24 ENSG00000124201 180 218 293 275 373 301 88 ENSG00000124207 76 80 85 97 80 81 37 ENSG00000125835 132 200 200 228 280 204 52 16489 more rows ... $all.zeros ENSG00000124208 ENSG00000182463 ENSG00000124201 ENSG00000124207 ENSG00000125835 FALSE FALSE FALSE FALSE FALSE 16489 more elements ... plotMDS.dge(d, main = "MDS Plot for Li Data", xlim = c(-1, 1),labels = c("Control1","Control2","Control3","Control4","DHT1", "DHT2", "DHT3"))
On dimension 2, DHT3 is quite different from the other DHT samples. It should be noted that the library size of DHT3 is three times smaller than DHT1 and DHT2.
Meanwhile in DESeq:
library("DESeq") countsTable <- read.delim("pnas_expression_filtered.tsv", header=TRUE, stringsAsFactors=TRUE) rownames(countsTable) <- countsTable$ensembl_ID countsTable <- countsTable[,-1] conds <- c("N","N","N","N","T","T","T") cds <- newCountDataSet (countsTable, conds) cds <- estimateSizeFactors(cds) sizeFactors(cds) lane1 lane2 lane3 lane4 lane5 lane6 lane8 0.7911762 0.9433534 1.1885656 1.2310312 1.4100653 1.4225280 0.5140726
For more information on how DESeq calculates the size factors, see this post. I attempted to calculate these values manually by following their instructions and I got 0.726441658 0.87339311 1.103958663 1.141646857 1.276470588 1.285666056 0.449085824. Slightly off, perhaps due to Excel or more likely due to something I did wrong.
On the other hand, edgeR has these normalisation factors:
group lib.size norm.factors
lane1 Control 976847 1.0296636
lane2 Control 1154746 1.0372521
lane3 Control 1439393 1.0362662
lane4 Control 1482652 1.0378383
lane5 DHT 1820628 0.9537095
lane6 DHT 1831553 0.9525624
lane8 DHT 680798 0.9583181
See this post for more information on the norm.factors.
Before calling differential expression on any of the genes, it is important to estimate the variance. Both methods use the negative binomial distribution to model the data as the level of noise is beyond what the Poisson model can account for due to biological variance.
Using edgeR to calculate the variance
d <- estimateCommonDisp(d) d$samples$lib.size [1] 976847 1154746 1439393 1482652 1820628 1831553 680798 d$common.lib.size [1] 1274589 colSums(d$pseudo.alt) lane1 lane2 lane3 lane4 lane5 lane6 lane8 1237946 1228854 1230106 1228297 1336795 1338335 1331880 d$common.dispersion [1] 0.01999041 sqrt(d$common.dispersion) [1] 0.1413875
The common dispersion is the "squared coefficient of variation", where the coefficient of variation gives the amount of variability in the true abundance for each gene between replicates. The common dispersion estimate is ~0.02, which (take square root) gives an estimate of the coefficient of variation of 0.14. The true abundance for each gene can vary up or down by 14% between replicates.
Using DESeq:
cds <- estimateVarianceFunctions(cds) scvPlot(cs,ylim=c(0,2))
The x axis is the base mean, calculated using the size factors and the y axis is the squared coefficient of variation, the ratio of the variance at base level to the square of the base mean. The solid lines are the SCV for the raw variances. The dotted line is the base variance, calculated using the size factors. The solid black line is a density estimate of the base means; only where there is an appreciable number of base mean values, the variance estimates can be expected to be accurate. Observe the decrease in variance as the base mean of the genes increases.
diagForT <- varianceFitDiagnostics(cds,"T") head(diagForT) baseMean baseVar fittedRawVar fittedBaseVar pchisq ENSG00000124208 437.57559 7106.70216 2133.97917 2623.68628 0.9333747 ENSG00000182463 39.79685 40.93749 50.62900 95.16714 0.3495977 ENSG00000124201 215.76798 2191.36827 668.79704 910.27101 0.9099491 ENSG00000124207 61.88337 76.38028 96.49555 165.75153 0.3692287 ENSG00000125835 147.71069 2386.52531 366.69439 532.00292 0.9887335 ENSG00000125834 73.61848 310.50106 125.27327 207.66245 0.7758007 smoothScatter (log10(diagForT$baseMean), log10(diagForT$baseVar)) lines (log10(fittedBaseVar) ~ log10(baseMean), diagForT[order(diagForT$baseMean),],col="red") residualsEcdfPlot(cds,"T") residualsEcdfPlot(cds,"N")
The fitted base variance is the predicted value from the local fit through the base variance estimates from all genes.
The base mean is proportional to the base variance until there is high biological variability with abundantly expressed genes.
The pchisq column from the varianceFitDiagnostics() function is used to determine whether single-gene estimate of the base variance deviates too much from the fitted value (which is calculated by taking the variance from all genes into account). The above plots show an empirical cumulative density function (ECDF) of the pchisq values stratified by base level. Basically if the curves are above the green line, where variance is overestimated, there is nothing to worry about.
Calling DE using edgeR and using a common dispersion
de.com <- exactTest(d) Comparison of groups: DHT - Control topTags(de.com) Comparison of groups: DHT-Control logConc logFC P.Value FDR ENSG00000151503 -11.93733 5.821695 6.831306e-193 1.126756e-188 ENSG00000096060 -11.32227 5.009483 1.311104e-162 1.081267e-158 ENSG00000127954 -15.62219 8.235552 2.461375e-153 1.353264e-149 ENSG00000166451 -12.27691 4.686616 1.108080e-134 4.569166e-131 ENSG00000131016 -14.41810 5.306943 4.156214e-110 1.371052e-106 ENSG00000113594 -12.82287 4.116740 5.371074e-102 1.476508e-98 ENSG00000116285 -13.55664 4.204769 4.673617e-93 1.101238e-89 ENSG00000123983 -12.08499 3.660756 1.263258e-92 2.604521e-89 ENSG00000166086 -15.23487 5.507542 1.925054e-90 3.527982e-87 ENSG00000162772 -10.80630 3.318369 7.379002e-86 1.217093e-82 summary(decideTestsDGE(de.com, p.value=0.01)) [,1] -1 1621 0 13110 1 1763 summary(decideTestsDGE(de.com, p.value=0.05)) [,1] -1 2467 0 11555 1 2472
At an adjusted p-value of 0.05, 4,939 are differentially expressed.
d <- estimateTagwiseDisp(d,prior.n=10,trend=TRUE, prop.used=0.3, grid.length = 500) summary(d$tagwise.dispersion) Min. 1st Qu. Median Mean 3rd Qu. Max. 0.009082 0.021450 0.040580 0.062720 0.097690 0.383100 plot(log2(d$conc$conc.common), d$tagwise.dispersion, panel.first = grid()) abline(h=d$common.dispersion, col="dodgerblue",lwd=3)
Dispersion decreases with tag concentration.
de.tgw <- exactTest (d,common.disp=FALSE) Comparison of groups: DHT - Control topTags(de.tgw) Comparison of groups: DHT-Control logConc logFC P.Value FDR ENSG00000151503 -11.936602 5.820621 6.372940e-312 1.051153e-307 ENSG00000096060 -11.321444 5.007775 3.937429e-270 3.247198e-266 ENSG00000166451 -12.278413 4.686846 1.314101e-213 7.224927e-210 ENSG00000127954 -15.621048 8.234180 1.513030e-191 6.238980e-188 ENSG00000113594 -12.824797 4.115144 4.910925e-156 1.620016e-152 ENSG00000162772 -10.805702 3.317800 6.173781e-145 1.697172e-141 ENSG00000123983 -12.085611 3.657763 1.283904e-129 3.025245e-126 ENSG00000116133 -11.730415 3.245004 1.530499e-126 3.155506e-123 ENSG00000115648 -8.820711 2.598023 1.686581e-124 3.090942e-121 ENSG00000116285 -13.556232 4.206675 2.656142e-122 4.381040e-119 summary(decideTestsDGE(de.tgw,p.value=0.05)) [,1] -1 2124 0 12054 1 2316 mv <- plotMeanVar(d, show.raw.vars = TRUE, show.tagwise.vars=TRUE,dispersion.method="qcml",NBline=TRUE)
See smooth scatter plot above.
Calling differential expression using DESeq
res <- nbinomTest( cds,"N","T") resSig <- res[res$padj < 0.05,] nrow(resSig) [1] 4965
Taking the top 100 differentially expressed genes (ordered by p-value) from both lists there are only 88 overlaps. Taking the top 500, there are 450 overlaps.
This work is licensed under a Creative Commons
Attribution 4.0 International License.
Looks like in edgeR when calling significantly DE genes “summary(decideTestsDGE(de.com, p.value=0.05))” you are using the raw p.value, not the adjusted p.value. You’d need to add: adjust.method=”BH” or similar to get the adjusted val.
woops, my bad. Looks like that is the default anyway. Ignore previous post (and this one too I guess!)
Hi Seth,
Thanks for the comments. It’s good to point that it is the default just in case others were wondering too. So no need to ignore 🙂
Cheers,
Dave
Hello, the information is very helpful. I’m very curious why the overlap is so small of the top ranked p.values between DESeq and edgeR results. May I have your ideas?
Besides, what is d$pseudo.alt? Is it the quantile normalized data? How can we tell, at each step when we input DEGlist d, which count data does the algorithm use? For example, does it use original raw read count, or pseudo.alt?
Hi Yuan,
Thanks for the comment. I just redid the analysis using R version 2.14.1, edgeR_2.4.1, DESeq_1.6.1 and Bioconductor version 2.9.
The overlap between the top 100 genes was 88/100 and for the top 500 genes it was 450/500. I probably did not sort the DESeq output by the adjusted p-value in my original analysis (now that I realise the output is not sorted by default unlike edgeR). I will modify the post.
From the edgeR manual,
The element common.dispersion, as the name suggests, provides the estimate of the common dispersion, and pseudo.alt gives the pseudocounts calculated under the alternative hypothesis.
The element common.lib.size gives the library size to which the original libraries have been adjusted in the pseudocounts.
If I do not run:
d <- calcNormFactors(d) The output becomes: > d$common.lib.size
[1] 1274589
> colSums(d$pseudo.alt)
lane1 lane2 lane3 lane4 lane5 lane6 lane8
1274905 1274630 1274702 1274787 1274935 1274798 1275881
The manual is much more informative than my blog will ever be. If you still have questions, you should email bioconductor at r-project.org, and most likely get a reply from the authors of the packages.
Cheers,
Dave
Thanks! –Yuan
Do you happen to know how to get the genes that contribute most to the two axes out from the MDS?
Cheers,
Seth
Hi Seth,
For the MDS plot I don’t know. However you could perform a PCA using R, then have a look at the scores of each feature at the different components, for example:
data <- read.table("some.file") data.pca <- prcomp(data) data.pca$x
I don't know the difference between a MDS analysis versus a PCA, but they seem so be similar.
Hope that helps,
Dave
Thanks for the suggestion