Creating UCSC Genome Browser custom tracks with links

An extremely useful feature of the UCSC Genome Browser, which I have been using for many years, is the ability to create links your genomic features in your custom track. For more information, see this page, step 5.

For example, I want to make a bed file of SNPs and load them as a custom track. When displayed on the Genome Browser, I want to directly go to the dbSNP website by clicking on the feature. All you need to do is add this line into your custom bed file:

track name="dbSNPs" description="SNPs from dbSNP" url="$$"

Then when you click on the SNP, you should see a page with an "Outside Link".

Learning to use biomaRt

In the past I've been manually downloading tables of data annoation and parsing them with Perl. I guess it's time to do things more elegantly. Below is code taken from the biomaRt vignette:

If you are using Ubuntu and getting a "Cannot find xml2-config" problem while installing XML, a prequisite to biomaRt, try installing libxml2-dev:

sudo apt-get install libxml2-dev

Continue reading

Making a barplot in R

Just a short post on making a barplot in R after reading in data via the read.table() function. I created a file with two rows, the first row containing the header and the second row containing the data values.

a       b       c       d       e
10      20      30      20      10

x <- read.table("some.file")
#the data needs to be in matrix format for barplot()
#to label the x-axis
#to label the x-axis using horizontal labels
#export as postscript

Horizontal barplot sorted by values

x <- matrix(sample(100,10),ncol=10,nrow=1)
      a  b  c  d  e  f  g  h  i  j
[1,] 33 92 99 15 53 24 62 27 58 72

colnames(x) <- c('a','b','c','d','e','f','g','h','i','j')
x <- x[,order(x)]
 d  f  h  a  e  i  g  j  b  c 
15 24 27 33 53 58 62 72 92 99
barplot(x, horiz=T, las=1)


Comparing different distributions

I recently learned of the Kolmogorov-Smirnov Test and how one can use it to test whether two datasets are likely to be different, i.e. comparing different distributions. Strictly speaking, the p-value gives us a probability of whether or not we can reject the null hypothesis, which is that two datasets have the same distribution. Using R:

#example 1
x <- rpois(n=1000,lambda=100)
y <- rpois(n=1000,lambda=100)

	Two-sample Kolmogorov-Smirnov test

data:  x and y
D = 0.036, p-value = 0.5361
alternative hypothesis: two-sided

Warning message:
In ks.test(x, y) : p-value will be approximate in the presence of ties

#example 2
#how about having a large population
#and drawing different samples for comparison
pop <- rpois(n=1000000, lambda=1000)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    848     979    1000    1000    1021    1165

#sample 1
sample_1 <- sample(x=pop, size=1000)
#add some noise/jitter
sample_1 <- jitter(sample_1, factor=10)

sample_2 <- sample(x=pop, size=1000)
#add some noise/jitter
sample_2 <- jitter(sample_2, factor=10)


	Two-sample Kolmogorov-Smirnov test

data:  sample_1 and sample_2
D = 0.026, p-value = 0.8879
alternative hypothesis: two-sided

In both examples, we cannot reject the null hypothesis that the two distributions are different. Since in the first example they are both from a Poisson distribution with the same lambda and in the second example, the samples were drawn from the same population.

x <- rpois(n=1000,lambda=100)
y <- rnorm(n=1000,mean=100)

	Two-sample Kolmogorov-Smirnov test

data:  x and y
D = 0.433, p-value < 2.2e-16
alternative hypothesis: two-sided

Warning message:
In ks.test(x, y) : p-value will be approximate in the presence of ties

In this case we get an extremely low p-value, and we can reject the null, which is that both the distributions are the same and they are not (one is a Normal distribution and the other a Poisson distribution).

The warning messages are due to the implementation of the KS test in R, which expects a continuous distribution and thus there should not be any identical values in the two datasets i.e. ties. I've read several sources and they all mention that the KS test can deal with both discrete and continuous data (I'm guessing because it mainly deals with cumulative quantiles) but I'm not sure about the implementation in R.


Testing for normality

On a related topic, how can we check whether our dataset is normally distributed? One can use use the QQ-normal plot to check if the values lie on a straight diagonal line.

norm_data <- rnorm(n=1000,mean=200,sd=50)


gamma_data <- rgamma(n=1000,shape=1)


What about some statistical test? Let's try the Shapiro-Wilk test, which tests the null hypothesis that a sample came from a normally distributed population.

#we can't reject the null, since it really came from a normally distribution population

	Shapiro-Wilk normality test

data:  norm_data
W = 0.9984, p-value = 0.4765
#we can reject the null, since the values were from a gamma distribution

	Shapiro-Wilk normality test

data:  gamma_data
W = 0.8146, p-value < 2.2e-16
#ks.test reject null that both are the same
ks.test(norm_data, gamma_data)

	Two-sample Kolmogorov-Smirnov test

data:  norm_data and gamma_data
D = 1, p-value < 2.2e-16
alternative hypothesis: two-sided

See this useful discussion on StackExchange on testing for normality.

I've joined Twitter

Today while reading a paper, I found some interesting one-liner facts. They are way too short to create a post on but I would like to make a repository of them. What better place to store these facts than Twitter!

You can follow me on Twitter for a list of facts on molecular biology and on bioinformatics that I didn't know or have forgotten about over the years.

Update 7th April 2013

I've been using Twitter for almost a year now and have found better use of it then what I had originally planned. I've learned of interesting papers, software and entertaining bits and pieces from the people I follow. I've started to play around with the R TwitteR package and started to read a book on Twitter, called The Tao of Twitter. It's quite business orientated but I'm enjoying the read.

The main concept of the book is describing the Tao, which refers to these three concepts:

  1. Targeted connections
  2. Meaningful content
  3. Authentic helpfulness

where by you can get the most out of Twitter if you network with the right people, share information that is useful to others and are generally perceived as being genuine when it comes to helping others.

Variance in RNA-Seq data

Updated 2014 April 18th

For this post I will use data from this study, that has been nicely summarised already to examine the variance in RNA-Seq data. Briefly, the study used LNCaP cells, which are androgen-sensitive human prostate adenocarcinoma cells, and treated the cells with DHT and with a mock treatment as the control. The poly-A RNAs were captured and sequenced on 7 lanes. Let's get started:

#I host this file on my server for convenience
file_url <- ''
data <- read.table(file_url, header=T, sep="\t", stringsAsFactors=F, row.names=1)

#check out the first 6 lines of data
#                lane1 lane2 lane3 lane4 lane5 lane6 lane8  len
#ENSG00000215696     0     0     0     0     0     0     0  330
#ENSG00000215700     0     0     0     0     0     0     0 2370
#ENSG00000215699     0     0     0     0     0     0     0 1842
#ENSG00000215784     0     0     0     0     0     0     0 2393
#ENSG00000212914     0     0     0     0     0     0     0  384
#ENSG00000212042     0     0     0     0     0     0     0   92

#what are the dimensions of the dataset?
#[1] 37435     8

#as we can see above
#many of the genes have zero expression
#let's filter them out
data_subset <- data[rowSums(data[,-8])>0,]
#[1] 21877     8

At this point, we some expression from at least one lane for 21,877 genes. The first four lanes, i.e., columns 1-4, are the mock treated controls and lanes 5, 6, and 8, i.e., columns 5-7, are the DHT treated cells. The length of the genes is stored as the 8th column.

We can perform a simple counts per million normalisation:

#I'm not going to use the len column
data_subset_cpm <- data_subset[,-8]

#check out the column sums
#  lane1   lane2   lane3   lane4   lane5   lane6   lane8 
# 978576 1156844 1442169 1485604 1823460 1834335  681743 

#function for normalisation
norm_cpm <- function(x){
  x * 1000000 / sum(x)

#apply the normalisation
data_subset_cpm <- apply(data_subset_cpm, 2, norm_cpm)
#                    lane1     lane2    lane3      lane4     lane5     lane6     lane8
#ENSG00000124208 488.46487 535.07647 435.4552 500.806406 264.88105 390.33219 352.03882
#ENSG00000182463  27.59111  17.28842  18.7218  17.501299  26.32358  29.98362  35.20388
#ENSG00000124201 183.94075 188.44373 203.1662 185.109895 204.55617 164.09216 129.08090
#ENSG00000124205   0.00000   0.00000   3.4670   3.365634   0.00000   0.00000   0.00000
#ENSG00000124207  77.66387  69.15366  58.9390  65.293308  43.87264  44.15769  54.27265
#ENSG00000125835 134.88988 172.88416 138.6800 153.472931 153.55423 111.21197  76.27508

#sanity check
#lane1 lane2 lane3 lane4 lane5 lane6 lane8 
#1e+06 1e+06 1e+06 1e+06 1e+06 1e+06 1e+06

After normalising simply by counts per million, we can calculate the mean and variance of the gene expression per group:

mock_var <- apply(data_subset_cpm[,1:4], 1, var)
mock_mean <- apply(data_subset_cpm[,1:4], 1, mean)
dht_var <- apply(data_subset_cpm[,5:7], 1, var)
dht_mean <- apply(data_subset_cpm[,5:7], 1, mean)


density_meanFew genes are very highly expressed.

And plotting the variance:


density_varThe gene with the highest variance in the DHT treatment samples is ten fold higher than the gene with the highest variance in the mock treatment samples.

Let's plot mean against variance:

#correlation of mean and variance
cor(mock_mean, mock_var, method="spearman")
#[1] 0.9353861
cor(dht_mean, dht_var, method="spearman")
#[1] 0.9139193

plot(log2(mock_mean), log2(mock_var), pch='.')
plot(log2(dht_mean), log2(dht_var), pch='.')

mean_vs_varThe mean and variance of gene expression is correlated; the higher the expression, the higher the variance.

How does the mean versus variance plot look for counts that follow a Poisson distribution?

sim <- t(sapply(dht_mean, rpois, n=4))
#                [,1] [,2] [,3] [,4]
#ENSG00000124208  336  332  333  363
#ENSG00000182463   28   48   28   21
#ENSG00000124201  153  171  171  175
#ENSG00000124205    0    0    0    0
#ENSG00000124207   42   36   50   47
#ENSG00000125835   91  114  138  113

sim_mean <- apply(sim, 1, mean)
sim_var <- apply(sim, 1, var)
cor(sim_mean, sim_var, method="spearman")
#[1] 0.9446539

plot(log2(sim_mean), log2(sim_var), pch='.')

sim_mean_vs_varThe mean versus the variance scatter plot of data simulated from a Poisson distribution is more symmetrical than real data.

Let's calculate the tagwise dispersion of our data using edgeR:

#install edgeR
#load library

d <- data[rowSums(data[,-8])>0,-8]
d <- d[,1:4]
group <- c(rep("mock",4))
d <- DGEList(counts = d, group=group)
d <- calcNormFactors(d)
d <- estimateCommonDisp(d, verbose=T)
#Disp = 0.01932 , BCV = 0.139
d <- estimateTagwiseDisp(d)

#the range of values for the tagwise dispersion
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
0.0003019 0.0215100 0.0555400 0.1560000 0.2044000 1.2320000

cor(d$AveLogCPM, log2(mock_mean), method="spearman")
[1] 0.9996904
#plotting my log2 mean expression against edgeR's calculation
plot(d$AveLogCPM, log2(mock_mean), pch='.')

mock_mean_vs_ave_log_cpmSanity check between edgeR's average log2 cpm calculations versus our own.

Let's plot the tagwise dispersion against the average log2 CPM:

plot(d$AveLogCPM, d$tagwise.dispersion, pch='.')

ave_log_cpm_vs_tagwise_dispThe higher the expression, the lower the tagwise dispersion.

Note that we had observed higher variances with genes that were more higher expressed. So tagwise dispersion is not simply the variance.

R version 3.1.0 (2014-04-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)

[1] LC_COLLATE=English_Australia.1252  LC_CTYPE=English_Australia.1252    LC_MONETARY=English_Australia.1252
[4] LC_NUMERIC=C                       LC_TIME=English_Australia.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] edgeR_3.6.0          limma_3.20.1         BiocInstaller_1.14.1

loaded via a namespace (and not attached):
[1] tools_3.1.0

Creating a matrix of scatter plots in R

Scatter plots are 2 dimensional plots that show the relationship between two variables. Here I demonstrate how we can create a matrix of scatter plots in R for datasets that have more than two variables. This is particularly useful when we want to visually inspect whether there are associations between variables.

#store random set of numbers in four variables
a <- runif(10,0,100) b <- runif(10,0,100) c <- runif(10,0,100) d <- runif(10,0,100) #create one big data.frame data <- data.frame(a,b,c,d) data # a b c d #1 81.47277 38.6969373 64.10036 83.194996 #2 18.99613 9.0549361 37.18465 79.644227 #3 22.16861 2.4355091 24.33679 77.087804 #4 64.18384 0.3513288 22.03435 39.354341 #5 57.88510 71.0003202 37.55931 45.650394 #6 21.13935 93.0838322 44.20160 37.522434 #7 35.74084 65.3453410 15.43229 28.034387 #8 25.55078 23.3961467 82.24938 58.981237 #9 57.99774 82.5590614 64.13307 49.083873 #10 5.55767 35.8025412 92.86938 5.454501 #viewing all pairs of scatterplots pairs(~a+b+c+d,data=data) [/sourcecode]

To display correlations on the lower panel (since the plots are redundant anyway):

#store random set of numbers in four variables
a <- runif(10,0,100) b <- runif(10,0,100) c <- runif(10,0,100) d <- runif(10,0,100) #create one big data.frame data <- data.frame(a,b,c,d) #define panel.cor function #code adapted from the R pairs man page panel.cor <- function(x, y, digits=2, prefix="", cex.cor, ...){ usr <- par("usr"); on.exit(par(usr)) par(usr = c(0, 1, 0, 1)) r = (cor(x, y)) txt <- format(c(r, 0.123456789), digits=digits)[1] txt <- paste(prefix, txt, sep="") text(0.5, 0.5, txt,cex=2) } pairs(data, lower.panel=panel.cor) [/sourcecode]