Updated: 2017 September 7th

When I was first learning about conducting a differential expression (DE) analysis with RNA-seq data, I found it very difficult to understand the statistical procedures implemented in various R packages that performed the DE analysis. This really bugged me. However, it was not difficult to carry out the analysis, since the authors of various packages had written easy-to-follow vignettes. Therefore to help myself understand the packages a little more, I would create various datasets with extreme properties to observe the effect/s on the analysis. For this post, I had followed this strategy to see how one particular parameter, the common dispersion in edgeR, would change with different datasets.

When we conduct a DE analysis, the null hypothesis is that, for a given gene, the mean expression between two groups is the same. Why can't we just perform a t-test on each gene then? Firstly, reads are count based and the Student's *t*-distribution is a continuous distribution.

http://bridgeslab.uthsc.edu/posts/why-do-we-use-the-negative-binomial-distribution-for-rnaseq

Start reading the edgeR user guide. The Poisson distribution does not account for all the variability observed in RNA-seq data. This post explains why RNA-seq is a Poisson process. In addition it is important to note that we are modelling counts across replicates (so that we can estimate the variance of a gene) when performing a differential expression analysis. As explained in this thread, under the "common dispersion model" we use the same value for the dispersion when modelling the variance for each gene.

The generated values represent the counts of reads, i.e., expression level, for a gene. Each row will represent one gene and I will generate counts for 20,000 genes. Each column represents a sample and I will have 4 samples. Therefore the common dispersion calculations will be based on a matrix that has 20,000 rows and 4 columns.

Firstly, let's generate the average expression of all genes using the negative binomial distribution:

#negative binomial distribution of mean expression #n = number of observations #size = target for number of successful trials #prob = probability of success in each trial set.seed(31) mean <- rnbinom(n=20000, size=1, prob=0.0001) #on average it takes ~10,000 times to have one success #this makes sense because I set the probability to 0.0001 summary(mean) # Min. 1st Qu. Median Mean 3rd Qu. Max. # 1 2803 6915 9898 13780 106300 plot(density(mean))

*This distribution is meant to represent the fact that few genes are very highly expressed*.

What's the common dispersion when the expression of every gene is identical in 4 samples?

#install if necessary source("http://bioconductor.org/biocLite.R") biocLite('edgeR') #load library library(edgeR) #identical expression identical <- matrix(rep(x=mean, each=4), nrow=length(mean), byrow=T) head(identical) # [,1] [,2] [,3] [,4] #[1,] 5386 5386 5386 5386 #[2,] 4282 4282 4282 4282 #[3,] 2243 2243 2243 2243 #[4,] 12830 12830 12830 12830 #[5,] 11962 11962 11962 11962 #[6,] 24473 24473 24473 24473 group <- c(rep("Control",4)) d <- DGEList(counts = identical, group=group) d <- calcNormFactors(d) d <- estimateCommonDisp(d) d <- estimateTagwiseDisp(d) d$common.dispersion #[1] 0.0001005378 summary(d$tagwise.dispersion) # Min. 1st Qu. Median Mean 3rd Qu. Max. #1.571e-06 1.571e-06 1.571e-06 1.571e-06 1.571e-06 1.571e-06

How about counts following a Poisson distribution:

set.seed(31) poisson <- t(sapply(mean, rpois, n=4)) head(poisson) # [,1] [,2] [,3] [,4] #[1,] 5390 5372 5377 5496 #[2,] 4252 4138 4198 4200 #[3,] 2227 2164 2196 2263 #[4,] 12883 12918 12741 12643 #[5,] 12014 11962 11731 11970 #[6,] 24839 24475 24443 24436 group <- c(rep("Control",4)) d <- DGEList(counts = poisson, group=group) d <- calcNormFactors(d) d <- estimateCommonDisp(d) d <- estimateTagwiseDisp(d) d$common.dispersion #[1] 0.0001005378 summary(d$tagwise.dispersion) # Min. 1st Qu. Median Mean 3rd Qu. Max. #1.571e-06 1.571e-06 1.571e-06 1.672e-05 9.963e-06 1.687e-03

The common dispersion between counts that were identical and counts that followed a Poisson distribution was identical.

Lastly, let's generate some counts using the negative binomial distribution:

#from the R documentation of rnbinom #An alternative parametrization (often used in ecology) is by the mean mu, #and size, the dispersion parameter, where prob = size/(size+mu). #The variance is mu + mu^2/size in this parametrization. set.seed(31) nb <- t(mapply(rnbinom, n=rep(4, 20000), size=rep(10,20000), mu=mean)) head(nb) # [,1] [,2] [,3] [,4] #[1,] 5196 5024 4259 3149 #[2,] 3536 2888 4768 3010 #[3,] 2476 924 4058 1993 #[4,] 15688 8399 6963 12968 #[5,] 12321 8653 11610 11104 #[6,] 26923 27319 24340 22717 #calculate means nb_mean <- apply(blah, 1, mean) #the means of the generated counts correlate #to the original means cor(nb_mean, mean, method="spearman") #[1] 0.9888302 group <- c(rep("Control",4)) d <- DGEList(counts = nb, group=group) d <- calcNormFactors(d) d <- estimateCommonDisp(d) d <- estimateTagwiseDisp(d) d$common.dispersion #[1] 0.1001726 summary(d$tagwise.dispersion) # Min. 1st Qu. Median Mean 3rd Qu. Max. #0.07591 0.08641 0.09526 0.10010 0.10870 0.22200

The common dispersion is higher when the read counts were generated from a negative binomial distribution. Lastly to understand the mu parameter of the rnbinom() function:

#mean mu my_mu <- 500 my_size <- 1 my_n <- 1000 set.seed(31) test <- rnbinom(n=my_n, size=my_size, mu=my_mu) summary(test) # Min. 1st Qu. Median Mean 3rd Qu. Max. # 0.0 149.8 338.0 498.2 715.0 3126.0 var(test) #[1] 245942.2 #dispersion (my_mu + (my_mu^2/my_size)) #[1] 250500 #prob my_size/(my_size+my_mu) #[1] 0.001996008 plot(density(test))

The mean of the sampled read counts (498.2) is close what we specified as mu (500) and the variance of the sampled read counts (245942.2) is close to the dispersion parameter (my_mu + (my_mu^2/my_size)). The smaller the size parameter and the larger mu is, the larger the dispersion.

sessionInfo() R version 3.1.0 (2014-04-10) Platform: x86_64-w64-mingw32/x64 (64-bit) locale: [1] LC_COLLATE=English_Australia.1252 LC_CTYPE=English_Australia.1252 [3] LC_MONETARY=English_Australia.1252 LC_NUMERIC=C [5] LC_TIME=English_Australia.1252 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] MASS_7.3-31 edgeR_3.6.0 limma_3.20.1 loaded via a namespace (and not attached): [1] tools_3.1.0

This work is licensed under a Creative Commons

Attribution 4.0 International License.