Updated 2014 April 18th
For this post I will use data from this study, that has been nicely summarised already to examine the variance in RNA-Seq data. Briefly, the study used LNCaP cells, which are androgen-sensitive human prostate adenocarcinoma cells, and treated the cells with DHT and with a mock treatment as the control. The poly-A RNAs were captured and sequenced on 7 lanes. Let's get started:
#I host this file on my server for convenience file_url <- 'http://davetang.org/eg/pnas_expression.txt' data <- read.table(file_url, header=T, sep="\t", stringsAsFactors=F, row.names=1) #check out the first 6 lines of data head(data) # lane1 lane2 lane3 lane4 lane5 lane6 lane8 len #ENSG00000215696 0 0 0 0 0 0 0 330 #ENSG00000215700 0 0 0 0 0 0 0 2370 #ENSG00000215699 0 0 0 0 0 0 0 1842 #ENSG00000215784 0 0 0 0 0 0 0 2393 #ENSG00000212914 0 0 0 0 0 0 0 384 #ENSG00000212042 0 0 0 0 0 0 0 92 #what are the dimensions of the dataset? dim(data) #[1] 37435 8 #as we can see above #many of the genes have zero expression #let's filter them out data_subset <- data[rowSums(data[,-8])>0,] dim(data_subset) #[1] 21877 8
At this point, we some expression from at least one lane for 21,877 genes. The first four lanes, i.e., columns 1-4, are the mock treated controls and lanes 5, 6, and 8, i.e., columns 5-7, are the DHT treated cells. The length of the genes is stored as the 8th column.
We can perform a simple counts per million normalisation:
#I'm not going to use the len column
data_subset_cpm <- data_subset[,-8]
#check out the column sums
colSums(data_subset_cpm)
# lane1 lane2 lane3 lane4 lane5 lane6 lane8
# 978576 1156844 1442169 1485604 1823460 1834335 681743
#function for normalisation
norm_cpm <- function(x){
x * 1000000 / sum(x)
}
#apply the normalisation
data_subset_cpm <- apply(data_subset_cpm, 2, norm_cpm)
head(data_subset_cpm)
# lane1 lane2 lane3 lane4 lane5 lane6 lane8
#ENSG00000124208 488.46487 535.07647 435.4552 500.806406 264.88105 390.33219 352.03882
#ENSG00000182463 27.59111 17.28842 18.7218 17.501299 26.32358 29.98362 35.20388
#ENSG00000124201 183.94075 188.44373 203.1662 185.109895 204.55617 164.09216 129.08090
#ENSG00000124205 0.00000 0.00000 3.4670 3.365634 0.00000 0.00000 0.00000
#ENSG00000124207 77.66387 69.15366 58.9390 65.293308 43.87264 44.15769 54.27265
#ENSG00000125835 134.88988 172.88416 138.6800 153.472931 153.55423 111.21197 76.27508
#sanity check
colSums(data_subset_cpm)
#lane1 lane2 lane3 lane4 lane5 lane6 lane8
#1e+06 1e+06 1e+06 1e+06 1e+06 1e+06 1e+06
After normalising simply by counts per million, we can calculate the mean and variance of the gene expression per group:
mock_var <- apply(data_subset_cpm[,1:4], 1, var) mock_mean <- apply(data_subset_cpm[,1:4], 1, mean) dht_var <- apply(data_subset_cpm[,5:7], 1, var) dht_mean <- apply(data_subset_cpm[,5:7], 1, mean) par(mfrow=c(1,2)) plot(density(mock_mean)) plot(density(dht_mean))
Few genes are very highly expressed.
And plotting the variance:
par(mfrow=c(1,2)) plot(density(mock_var)) plot(density(dht_var)) par(mfrow=c(1,1))
The gene with the highest variance in the DHT treatment samples is ten fold higher than the gene with the highest variance in the mock treatment samples.
Let's plot mean against variance:
#correlation of mean and variance cor(mock_mean, mock_var, method="spearman") #[1] 0.9353861 cor(dht_mean, dht_var, method="spearman") #[1] 0.9139193 par(mfrow=c(1,2)) plot(log2(mock_mean), log2(mock_var), pch='.') plot(log2(dht_mean), log2(dht_var), pch='.') par(mfrow=c(1,1))
The mean and variance of gene expression is correlated; the higher the expression, the higher the variance.
How does the mean versus variance plot look for counts that follow a Poisson distribution?
set.seed(31) sim <- t(sapply(dht_mean, rpois, n=4)) head(sim) # [,1] [,2] [,3] [,4] #ENSG00000124208 336 332 333 363 #ENSG00000182463 28 48 28 21 #ENSG00000124201 153 171 171 175 #ENSG00000124205 0 0 0 0 #ENSG00000124207 42 36 50 47 #ENSG00000125835 91 114 138 113 sim_mean <- apply(sim, 1, mean) sim_var <- apply(sim, 1, var) cor(sim_mean, sim_var, method="spearman") #[1] 0.9446539 plot(log2(sim_mean), log2(sim_var), pch='.')
The mean versus the variance scatter plot of data simulated from a Poisson distribution is more symmetrical than real data.
Let's calculate the tagwise dispersion of our data using edgeR:
#install edgeR
source("http://bioconductor.org/biocLite.R")
biocLite('edgeR')
#load library
library(edgeR)
d <- data[rowSums(data[,-8])>0,-8]
d <- d[,1:4]
group <- c(rep("mock",4))
d <- DGEList(counts = d, group=group)
d <- calcNormFactors(d)
d <- estimateCommonDisp(d, verbose=T)
#Disp = 0.01932 , BCV = 0.139
d <- estimateTagwiseDisp(d)
#the range of values for the tagwise dispersion
summary(d$tagwise.dispersion)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0003019 0.0215100 0.0555400 0.1560000 0.2044000 1.2320000
cor(d$AveLogCPM, log2(mock_mean), method="spearman")
[1] 0.9996904
#plotting my log2 mean expression against edgeR's calculation
plot(d$AveLogCPM, log2(mock_mean), pch='.')
Sanity check between edgeR's average log2 cpm calculations versus our own.
Let's plot the tagwise dispersion against the average log2 CPM:
plot(d$AveLogCPM, d$tagwise.dispersion, pch='.')
The higher the expression, the lower the tagwise dispersion.
Note that we had observed higher variances with genes that were more higher expressed. So tagwise dispersion is not simply the variance.
sessionInfo() R version 3.1.0 (2014-04-10) Platform: x86_64-w64-mingw32/x64 (64-bit) locale: [1] LC_COLLATE=English_Australia.1252 LC_CTYPE=English_Australia.1252 LC_MONETARY=English_Australia.1252 [4] LC_NUMERIC=C LC_TIME=English_Australia.1252 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] edgeR_3.6.0 limma_3.20.1 BiocInstaller_1.14.1 loaded via a namespace (and not attached): [1] tools_3.1.0

This work is licensed under a Creative Commons
Attribution 4.0 International License.
Hi,
Really nice post. It is very helpful.
You have mentioned that –
“The mean and variance of gene expression is correlated; the higher the expression, the higher the variance”
Is this the general trend seen in RNA-Seq data or is it the inference for just this dataset because for my dataset I got mock_cor as 0.7173433 and dht_cor as 0.8350109 ?
Thanks for any help in advance.
Hi Ashwini,
the calculation of variance depends on the mean, so when the mean is higher (higher expression), so is the variance; so they should always be correlated.
To see how variable your replicates are, use the estimateCommonDisp() function from the edgeR package.
Cheers,
Dave