# edgeR’s common dispersion

Updated: 2017 September 7th

When I was first learning about conducting a differential expression (DE) analysis with RNA-seq data, I found it very difficult to understand the statistical procedures implemented in various R packages that performed the DE analysis. This really bugged me. However, it was not difficult to carry out the analysis, since the authors of various packages had written easy-to-follow vignettes. Therefore to help myself understand the packages a little more, I would create various datasets with extreme properties to observe the effect/s on the analysis. For this post, I had followed this strategy to see how one particular parameter, the common dispersion in edgeR, would change with different datasets.

When we conduct a DE analysis, the null hypothesis is that, for a given gene, the mean expression between two groups is the same. Why can't we just perform a t-test on each gene then? Firstly, reads are count based and the Student's t-distribution is a continuous distribution.

http://bridgeslab.uthsc.edu/posts/why-do-we-use-the-negative-binomial-distribution-for-rnaseq

Start reading the edgeR user guide. The Poisson distribution does not account for all the variability observed in RNA-seq data. This post explains why RNA-seq is a Poisson process. In addition it is important to note that we are modelling counts across replicates (so that we can estimate the variance of a gene) when performing a differential expression analysis. As explained in this thread, under the "common dispersion model" we use the same value for the dispersion when modelling the variance for each gene.

The generated values represent the counts of reads, i.e., expression level, for a gene. Each row will represent one gene and I will generate counts for 20,000 genes. Each column represents a sample and I will have 4 samples. Therefore the common dispersion calculations will be based on a matrix that has 20,000 rows and 4 columns.

Firstly, let's generate the average expression of all genes using the negative binomial distribution:

#negative binomial distribution of mean expression
#n    = number of observations
#size = target for number of successful trials
#prob = probability of success in each trial
set.seed(31)
mean <- rnbinom(n=20000, size=1, prob=0.0001)

#on average it takes ~10,000 times to have one success
#this makes sense because I set the probability to 0.0001
summary(mean)
#   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
#      1    2803    6915    9898   13780  106300

plot(density(mean))


This distribution is meant to represent the fact that few genes are very highly expressed.

What's the common dispersion when the expression of every gene is identical in 4 samples?

#install if necessary
source("http://bioconductor.org/biocLite.R")
biocLite('edgeR')

library(edgeR)

#identical expression
identical <- matrix(rep(x=mean, each=4),
nrow=length(mean),
byrow=T)

#      [,1]  [,2]  [,3]  [,4]
#[1,]  5386  5386  5386  5386
#[2,]  4282  4282  4282  4282
#[3,]  2243  2243  2243  2243
#[4,] 12830 12830 12830 12830
#[5,] 11962 11962 11962 11962
#[6,] 24473 24473 24473 24473

group <- c(rep("Control",4))
d <- DGEList(counts =  identical, group=group)
d <- calcNormFactors(d)
d <- estimateCommonDisp(d)
d <- estimateTagwiseDisp(d)

d$common.dispersion #[1] 0.0001005378 summary(d$tagwise.dispersion)
#     Min.   1st Qu.    Median      Mean   3rd Qu.      Max.
#1.571e-06 1.571e-06 1.571e-06 1.571e-06 1.571e-06 1.571e-06


How about counts following a Poisson distribution:

set.seed(31)
poisson <- t(sapply(mean, rpois, n=4))
#      [,1]  [,2]  [,3]  [,4]
#[1,]  5390  5372  5377  5496
#[2,]  4252  4138  4198  4200
#[3,]  2227  2164  2196  2263
#[4,] 12883 12918 12741 12643
#[5,] 12014 11962 11731 11970
#[6,] 24839 24475 24443 24436

group <- c(rep("Control",4))
d <- DGEList(counts =  poisson, group=group)
d <- calcNormFactors(d)
d <- estimateCommonDisp(d)
d <- estimateTagwiseDisp(d)
d$common.dispersion #[1] 0.0001005378 summary(d$tagwise.dispersion)
#     Min.   1st Qu.    Median      Mean   3rd Qu.      Max.
#1.571e-06 1.571e-06 1.571e-06 1.672e-05 9.963e-06 1.687e-03


The common dispersion between counts that were identical and counts that followed a Poisson distribution was identical.

Lastly, let's generate some counts using the negative binomial distribution:

#from the R documentation of rnbinom
#An alternative parametrization (often used in ecology) is by the mean mu,
#and size, the dispersion parameter, where prob = size/(size+mu).
#The variance is mu + mu^2/size in this parametrization.
set.seed(31)
nb <- t(mapply(rnbinom, n=rep(4, 20000), size=rep(10,20000), mu=mean))
#      [,1]  [,2]  [,3]  [,4]
#[1,]  5196  5024  4259  3149
#[2,]  3536  2888  4768  3010
#[3,]  2476   924  4058  1993
#[4,] 15688  8399  6963 12968
#[5,] 12321  8653 11610 11104
#[6,] 26923 27319 24340 22717

#calculate means
nb_mean <- apply(blah, 1, mean)

#the means of the generated counts correlate
#to the original means
cor(nb_mean, mean, method="spearman")
#[1] 0.9888302

group <- c(rep("Control",4))
d <- DGEList(counts =  nb, group=group)
d <- calcNormFactors(d)
d <- estimateCommonDisp(d)
d <- estimateTagwiseDisp(d)
d$common.dispersion #[1] 0.1001726 summary(d$tagwise.dispersion)
#   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
#0.07591 0.08641 0.09526 0.10010 0.10870 0.22200


The common dispersion is higher when the read counts were generated from a negative binomial distribution. Lastly to understand the mu parameter of the rnbinom() function:

#mean mu
my_mu <- 500
my_size <- 1
my_n <- 1000
set.seed(31)
test <- rnbinom(n=my_n, size=my_size, mu=my_mu)
summary(test)
#   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
#    0.0   149.8   338.0   498.2   715.0  3126.0
var(test)
#[1] 245942.2
#dispersion
(my_mu + (my_mu^2/my_size))
#[1] 250500
#prob
my_size/(my_size+my_mu)
#[1] 0.001996008
plot(density(test))


The mean of the sampled read counts (498.2) is close what we specified as mu (500) and the variance of the sampled read counts (245942.2) is close to the dispersion parameter (my_mu + (my_mu^2/my_size)). The smaller the size parameter and the larger mu is, the larger the dispersion.

sessionInfo()
R version 3.1.0 (2014-04-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)

locale:
[1] LC_COLLATE=English_Australia.1252  LC_CTYPE=English_Australia.1252
[3] LC_MONETARY=English_Australia.1252 LC_NUMERIC=C
[5] LC_TIME=English_Australia.1252

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] MASS_7.3-31  edgeR_3.6.0  limma_3.20.1

loaded via a namespace (and not attached):
[1] tools_3.1.0