Updated 2014 April 18th

For this post I will use data from this study, that has been nicely summarised already to examine the variance in RNA-Seq data. Briefly, the study used LNCaP cells, which are androgen-sensitive human prostate adenocarcinoma cells, and treated the cells with DHT and with a mock treatment as the control. The poly-A RNAs were captured and sequenced on 7 lanes. Let's get started:

#I host this file on my server for convenience
file_url <- 'http://davetang.org/eg/pnas_expression.txt'
data <- read.table(file_url, header=T, sep="\t", stringsAsFactors=F, row.names=1)
#check out the first 6 lines of data
head(data)
# lane1 lane2 lane3 lane4 lane5 lane6 lane8 len
#ENSG00000215696 0 0 0 0 0 0 0 330
#ENSG00000215700 0 0 0 0 0 0 0 2370
#ENSG00000215699 0 0 0 0 0 0 0 1842
#ENSG00000215784 0 0 0 0 0 0 0 2393
#ENSG00000212914 0 0 0 0 0 0 0 384
#ENSG00000212042 0 0 0 0 0 0 0 92
#what are the dimensions of the dataset?
dim(data)
#[1] 37435 8
#as we can see above
#many of the genes have zero expression
#let's filter them out
data_subset <- data[rowSums(data[,-8])>0,]
dim(data_subset)
#[1] 21877 8

At this point, we some expression from at least one lane for 21,877 genes. The first four lanes, i.e., columns 1-4, are the mock treated controls and lanes 5, 6, and 8, i.e., columns 5-7, are the DHT treated cells. The length of the genes is stored as the 8th column.

We can perform a simple counts per million normalisation:

#I'm not going to use the len column
data_subset_cpm <- data_subset[,-8]
#check out the column sums
colSums(data_subset_cpm)
# lane1 lane2 lane3 lane4 lane5 lane6 lane8
# 978576 1156844 1442169 1485604 1823460 1834335 681743
#function for normalisation
norm_cpm <- function(x){
x * 1000000 / sum(x)
}
#apply the normalisation
data_subset_cpm <- apply(data_subset_cpm, 2, norm_cpm)
head(data_subset_cpm)
# lane1 lane2 lane3 lane4 lane5 lane6 lane8
#ENSG00000124208 488.46487 535.07647 435.4552 500.806406 264.88105 390.33219 352.03882
#ENSG00000182463 27.59111 17.28842 18.7218 17.501299 26.32358 29.98362 35.20388
#ENSG00000124201 183.94075 188.44373 203.1662 185.109895 204.55617 164.09216 129.08090
#ENSG00000124205 0.00000 0.00000 3.4670 3.365634 0.00000 0.00000 0.00000
#ENSG00000124207 77.66387 69.15366 58.9390 65.293308 43.87264 44.15769 54.27265
#ENSG00000125835 134.88988 172.88416 138.6800 153.472931 153.55423 111.21197 76.27508
#sanity check
colSums(data_subset_cpm)
#lane1 lane2 lane3 lane4 lane5 lane6 lane8
#1e+06 1e+06 1e+06 1e+06 1e+06 1e+06 1e+06

After normalising simply by counts per million, we can calculate the mean and variance of the gene expression per group:

mock_var <- apply(data_subset_cpm[,1:4], 1, var)
mock_mean <- apply(data_subset_cpm[,1:4], 1, mean)
dht_var <- apply(data_subset_cpm[,5:7], 1, var)
dht_mean <- apply(data_subset_cpm[,5:7], 1, mean)
par(mfrow=c(1,2))
plot(density(mock_mean))
plot(density(dht_mean))

*Few genes are very highly expressed*.

And plotting the variance:

par(mfrow=c(1,2))
plot(density(mock_var))
plot(density(dht_var))
par(mfrow=c(1,1))

*The gene with the highest variance in the DHT treatment samples is ten fold higher than the gene with the highest variance in the mock treatment samples*.

Let's plot mean against variance:

#correlation of mean and variance
cor(mock_mean, mock_var, method="spearman")
#[1] 0.9353861
cor(dht_mean, dht_var, method="spearman")
#[1] 0.9139193
par(mfrow=c(1,2))
plot(log2(mock_mean), log2(mock_var), pch='.')
plot(log2(dht_mean), log2(dht_var), pch='.')
par(mfrow=c(1,1))

*The mean and variance of gene expression is correlated; the higher the expression, the higher the variance*.

How does the mean versus variance plot look for counts that follow a Poisson distribution?

set.seed(31)
sim <- t(sapply(dht_mean, rpois, n=4))
head(sim)
# [,1] [,2] [,3] [,4]
#ENSG00000124208 336 332 333 363
#ENSG00000182463 28 48 28 21
#ENSG00000124201 153 171 171 175
#ENSG00000124205 0 0 0 0
#ENSG00000124207 42 36 50 47
#ENSG00000125835 91 114 138 113
sim_mean <- apply(sim, 1, mean)
sim_var <- apply(sim, 1, var)
cor(sim_mean, sim_var, method="spearman")
#[1] 0.9446539
plot(log2(sim_mean), log2(sim_var), pch='.')

*The mean versus the variance scatter plot of data simulated from a Poisson distribution is more symmetrical than real data*.

Let's calculate the tagwise dispersion of our data using edgeR:

#install edgeR
source("http://bioconductor.org/biocLite.R")
biocLite('edgeR')
#load library
library(edgeR)
d <- data[rowSums(data[,-8])>0,-8]
d <- d[,1:4]
group <- c(rep("mock",4))
d <- DGEList(counts = d, group=group)
d <- calcNormFactors(d)
d <- estimateCommonDisp(d, verbose=T)
#Disp = 0.01932 , BCV = 0.139
d <- estimateTagwiseDisp(d)
#the range of values for the tagwise dispersion
summary(d$tagwise.dispersion)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0003019 0.0215100 0.0555400 0.1560000 0.2044000 1.2320000
cor(d$AveLogCPM, log2(mock_mean), method="spearman")
[1] 0.9996904
#plotting my log2 mean expression against edgeR's calculation
plot(d$AveLogCPM, log2(mock_mean), pch='.')

*Sanity check between edgeR's average log2 cpm calculations versus our own*.

Let's plot the tagwise dispersion against the average log2 CPM:

plot(d$AveLogCPM, d$tagwise.dispersion, pch='.')

*The higher the expression, the lower the tagwise dispersion*.

Note that we had observed higher variances with genes that were more higher expressed. So tagwise dispersion is not simply the variance.

sessionInfo()
R version 3.1.0 (2014-04-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
locale:
[1] LC_COLLATE=English_Australia.1252 LC_CTYPE=English_Australia.1252 LC_MONETARY=English_Australia.1252
[4] LC_NUMERIC=C LC_TIME=English_Australia.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] edgeR_3.6.0 limma_3.20.1 BiocInstaller_1.14.1
loaded via a namespace (and not attached):
[1] tools_3.1.0