Variance in RNA-Seq data

Updated 2014 April 18th

For this post I will use data from this study, that has been nicely summarised already to examine the variance in RNA-Seq data. Briefly, the study used LNCaP cells, which are androgen-sensitive human prostate adenocarcinoma cells, and treated the cells with DHT and with a mock treatment as the control. The poly-A RNAs were captured and sequenced on 7 lanes. Let’s get started:

#I host this file on my server for convenience
file_url <- 'http://davetang.org/eg/pnas_expression.txt'
data <- read.table(file_url, header=T, sep="\t", stringsAsFactors=F, row.names=1)

#check out the first 6 lines of data
head(data)
#                lane1 lane2 lane3 lane4 lane5 lane6 lane8  len
#ENSG00000215696     0     0     0     0     0     0     0  330
#ENSG00000215700     0     0     0     0     0     0     0 2370
#ENSG00000215699     0     0     0     0     0     0     0 1842
#ENSG00000215784     0     0     0     0     0     0     0 2393
#ENSG00000212914     0     0     0     0     0     0     0  384
#ENSG00000212042     0     0     0     0     0     0     0   92

#what are the dimensions of the dataset?
dim(data)
#[1] 37435     8

#as we can see above
#many of the genes have zero expression
#let's filter them out
data_subset <- data[rowSums(data[,-8])>0,]
dim(data_subset)
#[1] 21877     8

At this point, we some expression from at least one lane for 21,877 genes. The first four lanes, i.e., columns 1-4, are the mock treated controls and lanes 5, 6, and 8, i.e., columns 5-7, are the DHT treated cells. The length of the genes is stored as the 8th column.

We can perform a simple counts per million normalisation:

#I'm not going to use the len column
data_subset_cpm <- data_subset[,-8]

#check out the column sums
colSums(data_subset_cpm)
#  lane1   lane2   lane3   lane4   lane5   lane6   lane8 
# 978576 1156844 1442169 1485604 1823460 1834335  681743 

#function for normalisation
norm_cpm <- function(x){
  x * 1000000 / sum(x)
}

#apply the normalisation
data_subset_cpm <- apply(data_subset_cpm, 2, norm_cpm)
head(data_subset_cpm)
#                    lane1     lane2    lane3      lane4     lane5     lane6     lane8
#ENSG00000124208 488.46487 535.07647 435.4552 500.806406 264.88105 390.33219 352.03882
#ENSG00000182463  27.59111  17.28842  18.7218  17.501299  26.32358  29.98362  35.20388
#ENSG00000124201 183.94075 188.44373 203.1662 185.109895 204.55617 164.09216 129.08090
#ENSG00000124205   0.00000   0.00000   3.4670   3.365634   0.00000   0.00000   0.00000
#ENSG00000124207  77.66387  69.15366  58.9390  65.293308  43.87264  44.15769  54.27265
#ENSG00000125835 134.88988 172.88416 138.6800 153.472931 153.55423 111.21197  76.27508

#sanity check
colSums(data_subset_cpm)
#lane1 lane2 lane3 lane4 lane5 lane6 lane8 
#1e+06 1e+06 1e+06 1e+06 1e+06 1e+06 1e+06

After normalising simply by counts per million, we can calculate the mean and variance of the gene expression per group:

mock_var <- apply(data_subset_cpm[,1:4], 1, var)
mock_mean <- apply(data_subset_cpm[,1:4], 1, mean)
dht_var <- apply(data_subset_cpm[,5:7], 1, var)
dht_mean <- apply(data_subset_cpm[,5:7], 1, mean)

par(mfrow=c(1,2))
plot(density(mock_mean))
plot(density(dht_mean))

density_meanFew genes are very highly expressed.

And plotting the variance:

par(mfrow=c(1,2))
plot(density(mock_var))
plot(density(dht_var))
par(mfrow=c(1,1))

density_varThe gene with the highest variance in the DHT treatment samples is ten fold higher than the gene with the highest variance in the mock treatment samples.

Let’s plot mean against variance:

#correlation of mean and variance
cor(mock_mean, mock_var, method="spearman")
#[1] 0.9353861
cor(dht_mean, dht_var, method="spearman")
#[1] 0.9139193

par(mfrow=c(1,2))
plot(log2(mock_mean), log2(mock_var), pch='.')
plot(log2(dht_mean), log2(dht_var), pch='.')
par(mfrow=c(1,1))

mean_vs_varThe mean and variance of gene expression is correlated; the higher the expression, the higher the variance.

How does the mean versus variance plot look for counts that follow a Poisson distribution?

set.seed(31)
sim <- t(sapply(dht_mean, rpois, n=4))
head(sim)
#                [,1] [,2] [,3] [,4]
#ENSG00000124208  336  332  333  363
#ENSG00000182463   28   48   28   21
#ENSG00000124201  153  171  171  175
#ENSG00000124205    0    0    0    0
#ENSG00000124207   42   36   50   47
#ENSG00000125835   91  114  138  113

sim_mean <- apply(sim, 1, mean)
sim_var <- apply(sim, 1, var)
cor(sim_mean, sim_var, method="spearman")
#[1] 0.9446539

plot(log2(sim_mean), log2(sim_var), pch='.')

sim_mean_vs_varThe mean versus the variance scatter plot of data simulated from a Poisson distribution is more symmetrical than real data.

Let’s calculate the tagwise dispersion of our data using edgeR:

#install edgeR
source("http://bioconductor.org/biocLite.R")
biocLite('edgeR')
#load library
library(edgeR)

d <- data[rowSums(data[,-8])>0,-8]
d <- d[,1:4]
group <- c(rep("mock",4))
d <- DGEList(counts = d, group=group)
d <- calcNormFactors(d)
d <- estimateCommonDisp(d, verbose=T)
#Disp = 0.01932 , BCV = 0.139
d <- estimateTagwiseDisp(d)

#the range of values for the tagwise dispersion
summary(d$tagwise.dispersion)
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
0.0003019 0.0215100 0.0555400 0.1560000 0.2044000 1.2320000

cor(d$AveLogCPM, log2(mock_mean), method="spearman")
[1] 0.9996904
#plotting my log2 mean expression against edgeR's calculation
plot(d$AveLogCPM, log2(mock_mean), pch='.')

mock_mean_vs_ave_log_cpmSanity check between edgeR’s average log2 cpm calculations versus our own.

Let’s plot the tagwise dispersion against the average log2 CPM:

plot(d$AveLogCPM, d$tagwise.dispersion, pch='.')

ave_log_cpm_vs_tagwise_dispThe higher the expression, the lower the tagwise dispersion.

Note that we had observed higher variances with genes that were more higher expressed. So tagwise dispersion is not simply the variance.

sessionInfo()
R version 3.1.0 (2014-04-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)

locale:
[1] LC_COLLATE=English_Australia.1252  LC_CTYPE=English_Australia.1252    LC_MONETARY=English_Australia.1252
[4] LC_NUMERIC=C                       LC_TIME=English_Australia.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] edgeR_3.6.0          limma_3.20.1         BiocInstaller_1.14.1

loaded via a namespace (and not attached):
[1] tools_3.1.0
Print Friendly, PDF & Email



Creative Commons License
This work is licensed under a Creative Commons
Attribution 4.0 International License
.
2 comments Add yours
  1. Hi,
    Really nice post. It is very helpful.

    You have mentioned that –
    “The mean and variance of gene expression is correlated; the higher the expression, the higher the variance”

    Is this the general trend seen in RNA-Seq data or is it the inference for just this dataset because for my dataset I got mock_cor as 0.7173433 and dht_cor as 0.8350109 ?

    Thanks for any help in advance.

    1. Hi Ashwini,

      the calculation of variance depends on the mean, so when the mean is higher (higher expression), so is the variance; so they should always be correlated.

      To see how variable your replicates are, use the estimateCommonDisp() function from the edgeR package.

      Cheers,

      Dave

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.