Firstly from Davis's homepage download the file pnas_expression.txt. For more information on the dataset please refer to the edgeR manual and this paper.

The latest R version at the time of writing is R 2.13.1. You can download it from here.

Install bioconductor and the required packages:

source(“http://www.bioconductor.org/biocLite.R”)

biocLite()

biocLite(“DESeq”)

biocLite(“edgeR”)

A filtering criteria of keeping "only those tags that have at least one count per million in at least three samples" was specified in the edgeR manual. As I'm no good with R, I will preprocess the pnas_expression.txt file with Perl imposing the same filter and also removing the "len" column.

#!/usr/bin/perl use strict; use warnings; my $infile = 'pnas_expression.txt'; open(IN,'<',$infile) || die "Could not open $infile: $!\n"; my @library_size = (); while(<IN>){ chomp; next if /^ensembl_ID/; #ensembl_ID lane1 lane2 lane3 lane4 lane5 lane6 lane8 len #ENSG00000215696 0 0 0 0 0 0 0 330 my @line_split = split(/\t/); for (my $i = 1; $i<8; ++$i){ $library_size[$i] += $line_split[$i]; } } close(IN); open(IN,'<',$infile) || die "Could not open $infile: $!\n"; while(<IN>){ chomp; my @line_split = split(/\t/); pop(@line_split); if (/^ensembl_ID/){ print join("\t",@line_split),"\n"; next; } my $checkpoint = '0'; for (my $i = 1; $i<8; ++$i){ my $tpm = $line_split[$i] * 1_000_000 / $library_size[$i]; ++$checkpoint if $tpm >= 1; } print join ("\t",@line_split),"\n" if $checkpoint >= 3; } close(IN); exit(0); __END__

filter.pl > pnas_expression_filtered.tsv

library(edgeR) raw.data <- read.delim("pnas_expression_filtered.tsv") d <- raw.data[,2:8] rownames(d) <- raw.data[,1] group <- c(rep("Control",4),rep("DHT",3)) d <- DGEList(counts = d, group=group) dim(d) [1] 16494 7 d <- calcNormFactors(d) d An object of class "DGEList" $samples group lib.size norm.factors lane1 Control 976847 1.0296636 lane2 Control 1154746 1.0372521 lane3 Control 1439393 1.0362662 lane4 Control 1482652 1.0378383 lane5 DHT 1820628 0.9537095 lane6 DHT 1831553 0.9525624 lane8 DHT 680798 0.9583181 $counts lane1 lane2 lane3 lane4 lane5 lane6 lane8 ENSG00000124208 478 619 628 744 483 716 240 ENSG00000182463 27 20 27 26 48 55 24 ENSG00000124201 180 218 293 275 373 301 88 ENSG00000124207 76 80 85 97 80 81 37 ENSG00000125835 132 200 200 228 280 204 52 16489 more rows ... $all.zeros ENSG00000124208 ENSG00000182463 ENSG00000124201 ENSG00000124207 ENSG00000125835 FALSE FALSE FALSE FALSE FALSE 16489 more elements ... plotMDS.dge(d, main = "MDS Plot for Li Data", xlim = c(-1, 1),labels = c("Control1","Control2","Control3","Control4","DHT1", "DHT2", "DHT3"))

On dimension 2, DHT3 is quite different from the other DHT samples. It should be noted that the library size of DHT3 is three times smaller than DHT1 and DHT2.

Meanwhile in DESeq:

library("DESeq") countsTable <- read.delim("pnas_expression_filtered.tsv", header=TRUE, stringsAsFactors=TRUE) rownames(countsTable) <- countsTable$ensembl_ID countsTable <- countsTable[,-1] conds <- c("N","N","N","N","T","T","T") cds <- newCountDataSet (countsTable, conds) cds <- estimateSizeFactors(cds) sizeFactors(cds) lane1 lane2 lane3 lane4 lane5 lane6 lane8 0.7911762 0.9433534 1.1885656 1.2310312 1.4100653 1.4225280 0.5140726

For more information on how DESeq calculates the size factors, see this post. I attempted to calculate these values manually by following their instructions and I got 0.726441658 0.87339311 1.103958663 1.141646857 1.276470588 1.285666056 0.449085824. Slightly off, perhaps due to Excel or more likely due to something I did wrong.

On the other hand, edgeR has these normalisation factors:

group lib.size norm.factors

lane1 Control 976847 1.0296636

lane2 Control 1154746 1.0372521

lane3 Control 1439393 1.0362662

lane4 Control 1482652 1.0378383

lane5 DHT 1820628 0.9537095

lane6 DHT 1831553 0.9525624

lane8 DHT 680798 0.9583181

See this post for more information on the norm.factors.

Before calling differential expression on any of the genes, it is important to estimate the variance. Both methods use the negative binomial distribution to model the data as the level of noise is beyond what the Poisson model can account for due to biological variance.

Using edgeR to calculate the variance

d <- estimateCommonDisp(d) d$samples$lib.size [1] 976847 1154746 1439393 1482652 1820628 1831553 680798 d$common.lib.size [1] 1274589 colSums(d$pseudo.alt) lane1 lane2 lane3 lane4 lane5 lane6 lane8 1237946 1228854 1230106 1228297 1336795 1338335 1331880 d$common.dispersion [1] 0.01999041 sqrt(d$common.dispersion) [1] 0.1413875

The common dispersion is the "squared coefficient of variation", where the coefficient of variation gives the amount of variability in the true abundance for each gene between replicates. The common dispersion estimate is ~0.02, which (take square root) gives an estimate of the coefficient of variation of 0.14. The true abundance for each gene can vary up or down by 14% between replicates.

Using DESeq:

cds <- estimateVarianceFunctions(cds) scvPlot(cs,ylim=c(0,2))

The x axis is the base mean, calculated using the size factors and the y axis is the squared coefficient of variation, the ratio of the variance at base level to the square of the base mean. The solid lines are the SCV for the raw variances. The dotted line is the base variance, calculated using the size factors. The solid black line is a density estimate of the base means; only where there is an appreciable number of base mean values, the variance estimates can be expected to be accurate. Observe the decrease in variance as the base mean of the genes increases.

diagForT <- varianceFitDiagnostics(cds,"T") head(diagForT) baseMean baseVar fittedRawVar fittedBaseVar pchisq ENSG00000124208 437.57559 7106.70216 2133.97917 2623.68628 0.9333747 ENSG00000182463 39.79685 40.93749 50.62900 95.16714 0.3495977 ENSG00000124201 215.76798 2191.36827 668.79704 910.27101 0.9099491 ENSG00000124207 61.88337 76.38028 96.49555 165.75153 0.3692287 ENSG00000125835 147.71069 2386.52531 366.69439 532.00292 0.9887335 ENSG00000125834 73.61848 310.50106 125.27327 207.66245 0.7758007 smoothScatter (log10(diagForT$baseMean), log10(diagForT$baseVar)) lines (log10(fittedBaseVar) ~ log10(baseMean), diagForT[order(diagForT$baseMean),],col="red") residualsEcdfPlot(cds,"T") residualsEcdfPlot(cds,"N")

The fitted base variance is the predicted value from the local fit through the base variance estimates from all genes.

The base mean is proportional to the base variance until there is high biological variability with abundantly expressed genes.

The pchisq column from the varianceFitDiagnostics() function is used to determine whether single-gene estimate of the base variance deviates too much from the fitted value (which is calculated by taking the variance from all genes into account). The above plots show an empirical cumulative density function (ECDF) of the pchisq values stratified by base level. Basically if the curves are above the green line, where variance is overestimated, there is nothing to worry about.

Calling DE using edgeR and using a common dispersion

de.com <- exactTest(d) Comparison of groups: DHT - Control topTags(de.com) Comparison of groups: DHT-Control logConc logFC P.Value FDR ENSG00000151503 -11.93733 5.821695 6.831306e-193 1.126756e-188 ENSG00000096060 -11.32227 5.009483 1.311104e-162 1.081267e-158 ENSG00000127954 -15.62219 8.235552 2.461375e-153 1.353264e-149 ENSG00000166451 -12.27691 4.686616 1.108080e-134 4.569166e-131 ENSG00000131016 -14.41810 5.306943 4.156214e-110 1.371052e-106 ENSG00000113594 -12.82287 4.116740 5.371074e-102 1.476508e-98 ENSG00000116285 -13.55664 4.204769 4.673617e-93 1.101238e-89 ENSG00000123983 -12.08499 3.660756 1.263258e-92 2.604521e-89 ENSG00000166086 -15.23487 5.507542 1.925054e-90 3.527982e-87 ENSG00000162772 -10.80630 3.318369 7.379002e-86 1.217093e-82 summary(decideTestsDGE(de.com, p.value=0.01)) [,1] -1 1621 0 13110 1 1763 summary(decideTestsDGE(de.com, p.value=0.05)) [,1] -1 2467 0 11555 1 2472

At an adjusted p-value of 0.05, 4,939 are differentially expressed.

d <- estimateTagwiseDisp(d,prior.n=10,trend=TRUE, prop.used=0.3, grid.length = 500) summary(d$tagwise.dispersion) Min. 1st Qu. Median Mean 3rd Qu. Max. 0.009082 0.021450 0.040580 0.062720 0.097690 0.383100 plot(log2(d$conc$conc.common), d$tagwise.dispersion, panel.first = grid()) abline(h=d$common.dispersion, col="dodgerblue",lwd=3)

Dispersion decreases with tag concentration.

de.tgw <- exactTest (d,common.disp=FALSE) Comparison of groups: DHT - Control topTags(de.tgw) Comparison of groups: DHT-Control logConc logFC P.Value FDR ENSG00000151503 -11.936602 5.820621 6.372940e-312 1.051153e-307 ENSG00000096060 -11.321444 5.007775 3.937429e-270 3.247198e-266 ENSG00000166451 -12.278413 4.686846 1.314101e-213 7.224927e-210 ENSG00000127954 -15.621048 8.234180 1.513030e-191 6.238980e-188 ENSG00000113594 -12.824797 4.115144 4.910925e-156 1.620016e-152 ENSG00000162772 -10.805702 3.317800 6.173781e-145 1.697172e-141 ENSG00000123983 -12.085611 3.657763 1.283904e-129 3.025245e-126 ENSG00000116133 -11.730415 3.245004 1.530499e-126 3.155506e-123 ENSG00000115648 -8.820711 2.598023 1.686581e-124 3.090942e-121 ENSG00000116285 -13.556232 4.206675 2.656142e-122 4.381040e-119 summary(decideTestsDGE(de.tgw,p.value=0.05)) [,1] -1 2124 0 12054 1 2316 mv <- plotMeanVar(d, show.raw.vars = TRUE, show.tagwise.vars=TRUE,dispersion.method="qcml",NBline=TRUE)

See smooth scatter plot above.

Calling differential expression using DESeq

res <- nbinomTest( cds,"N","T") resSig <- res[res$padj < 0.05,] nrow(resSig) [1] 4965

Taking the top 100 differentially expressed genes (ordered by p-value) from both lists there are only 88 overlaps. Taking the top 500, there are 450 overlaps.