Updated 2014 April 18th

For this post I will use data from this study, that has been nicely summarised already to examine the variance in RNA-Seq data. Briefly, the study used LNCaP cells, which are androgen-sensitive human prostate adenocarcinoma cells, and treated the cells with DHT and with a mock treatment as the control. The poly-A RNAs were captured and sequenced on 7 lanes. Let’s get started:

#I host this file on my server for convenience file_url <- 'http://davetang.org/eg/pnas_expression.txt' data <- read.table(file_url, header=T, sep="\t", stringsAsFactors=F, row.names=1) #check out the first 6 lines of data head(data) # lane1 lane2 lane3 lane4 lane5 lane6 lane8 len #ENSG00000215696 0 0 0 0 0 0 0 330 #ENSG00000215700 0 0 0 0 0 0 0 2370 #ENSG00000215699 0 0 0 0 0 0 0 1842 #ENSG00000215784 0 0 0 0 0 0 0 2393 #ENSG00000212914 0 0 0 0 0 0 0 384 #ENSG00000212042 0 0 0 0 0 0 0 92 #what are the dimensions of the dataset? dim(data) #[1] 37435 8 #as we can see above #many of the genes have zero expression #let's filter them out data_subset <- data[rowSums(data[,-8])>0,] dim(data_subset) #[1] 21877 8

At this point, we some expression from at least one lane for 21,877 genes. The first four lanes, i.e., columns 1-4, are the mock treated controls and lanes 5, 6, and 8, i.e., columns 5-7, are the DHT treated cells. The length of the genes is stored as the 8th column.

We can perform a simple counts per million normalisation:

#I'm not going to use the len column data_subset_cpm <- data_subset[,-8] #check out the column sums colSums(data_subset_cpm) # lane1 lane2 lane3 lane4 lane5 lane6 lane8 # 978576 1156844 1442169 1485604 1823460 1834335 681743 #function for normalisation norm_cpm <- function(x){ x * 1000000 / sum(x) } #apply the normalisation data_subset_cpm <- apply(data_subset_cpm, 2, norm_cpm) head(data_subset_cpm) # lane1 lane2 lane3 lane4 lane5 lane6 lane8 #ENSG00000124208 488.46487 535.07647 435.4552 500.806406 264.88105 390.33219 352.03882 #ENSG00000182463 27.59111 17.28842 18.7218 17.501299 26.32358 29.98362 35.20388 #ENSG00000124201 183.94075 188.44373 203.1662 185.109895 204.55617 164.09216 129.08090 #ENSG00000124205 0.00000 0.00000 3.4670 3.365634 0.00000 0.00000 0.00000 #ENSG00000124207 77.66387 69.15366 58.9390 65.293308 43.87264 44.15769 54.27265 #ENSG00000125835 134.88988 172.88416 138.6800 153.472931 153.55423 111.21197 76.27508 #sanity check colSums(data_subset_cpm) #lane1 lane2 lane3 lane4 lane5 lane6 lane8 #1e+06 1e+06 1e+06 1e+06 1e+06 1e+06 1e+06

After normalising simply by counts per million, we can calculate the mean and variance of the gene expression per group:

mock_var <- apply(data_subset_cpm[,1:4], 1, var) mock_mean <- apply(data_subset_cpm[,1:4], 1, mean) dht_var <- apply(data_subset_cpm[,5:7], 1, var) dht_mean <- apply(data_subset_cpm[,5:7], 1, mean) par(mfrow=c(1,2)) plot(density(mock_mean)) plot(density(dht_mean))

*Few genes are very highly expressed*.

And plotting the variance:

par(mfrow=c(1,2)) plot(density(mock_var)) plot(density(dht_var)) par(mfrow=c(1,1))

*The gene with the highest variance in the DHT treatment samples is ten fold higher than the gene with the highest variance in the mock treatment samples*.

Let’s plot mean against variance:

#correlation of mean and variance cor(mock_mean, mock_var, method="spearman") #[1] 0.9353861 cor(dht_mean, dht_var, method="spearman") #[1] 0.9139193 par(mfrow=c(1,2)) plot(log2(mock_mean), log2(mock_var), pch='.') plot(log2(dht_mean), log2(dht_var), pch='.') par(mfrow=c(1,1))

*The mean and variance of gene expression is correlated; the higher the expression, the higher the variance*.

How does the mean versus variance plot look for counts that follow a Poisson distribution?

set.seed(31) sim <- t(sapply(dht_mean, rpois, n=4)) head(sim) # [,1] [,2] [,3] [,4] #ENSG00000124208 336 332 333 363 #ENSG00000182463 28 48 28 21 #ENSG00000124201 153 171 171 175 #ENSG00000124205 0 0 0 0 #ENSG00000124207 42 36 50 47 #ENSG00000125835 91 114 138 113 sim_mean <- apply(sim, 1, mean) sim_var <- apply(sim, 1, var) cor(sim_mean, sim_var, method="spearman") #[1] 0.9446539 plot(log2(sim_mean), log2(sim_var), pch='.')

*The mean versus the variance scatter plot of data simulated from a Poisson distribution is more symmetrical than real data*.

Let’s calculate the tagwise dispersion of our data using edgeR:

#install edgeR source("http://bioconductor.org/biocLite.R") biocLite('edgeR') #load library library(edgeR) d <- data[rowSums(data[,-8])>0,-8] d <- d[,1:4] group <- c(rep("mock",4)) d <- DGEList(counts = d, group=group) d <- calcNormFactors(d) d <- estimateCommonDisp(d, verbose=T) #Disp = 0.01932 , BCV = 0.139 d <- estimateTagwiseDisp(d) #the range of values for the tagwise dispersion summary(d$tagwise.dispersion) Min. 1st Qu. Median Mean 3rd Qu. Max. 0.0003019 0.0215100 0.0555400 0.1560000 0.2044000 1.2320000 cor(d$AveLogCPM, log2(mock_mean), method="spearman") [1] 0.9996904 #plotting my log2 mean expression against edgeR's calculation plot(d$AveLogCPM, log2(mock_mean), pch='.')

*Sanity check between edgeR’s average log2 cpm calculations versus our own*.

Let’s plot the tagwise dispersion against the average log2 CPM:

plot(d$AveLogCPM, d$tagwise.dispersion, pch='.')

*The higher the expression, the lower the tagwise dispersion*.

Note that we had observed higher variances with genes that were more higher expressed. So tagwise dispersion is not simply the variance.

sessionInfo() R version 3.1.0 (2014-04-10) Platform: x86_64-w64-mingw32/x64 (64-bit) locale: [1] LC_COLLATE=English_Australia.1252 LC_CTYPE=English_Australia.1252 LC_MONETARY=English_Australia.1252 [4] LC_NUMERIC=C LC_TIME=English_Australia.1252 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] edgeR_3.6.0 limma_3.20.1 BiocInstaller_1.14.1 loaded via a namespace (and not attached): [1] tools_3.1.0

This work is licensed under a Creative Commons

Attribution 4.0 International License.

Hi,

Really nice post. It is very helpful.

You have mentioned that –

“The mean and variance of gene expression is correlated; the higher the expression, the higher the variance”

Is this the general trend seen in RNA-Seq data or is it the inference for just this dataset because for my dataset I got mock_cor as 0.7173433 and dht_cor as 0.8350109 ?

Thanks for any help in advance.

Hi Ashwini,

the calculation of variance depends on the mean, so when the mean is higher (higher expression), so is the variance; so they should always be correlated.

To see how variable your replicates are, use the estimateCommonDisp() function from the edgeR package.

Cheers,

Dave