# Using the GenometriCorr package

I was reading through the bedtools jaccard documentation when I saw the reference "Exploring Massive, Genome Scale Datasets with the GenometriCorr Package". Firstly for those wondering what the Jaccard index is, it's a simple metric that is defined as so:

The numerator is the number of intersections between A and B, and the denominator is the union of A and B. If you wanted a metric to summarise the amount of intersection between two sets of features, check out bedtools jaccard. Back to the topic, I decided to check out the paper, as the title attracted me and I wanted to learn more about the GenometriCorr package. It turns out that it's an R package.

# Pearson vs. Spearman correlation

Correlation measures are commonly used to show how correlated two sets of datasets are. A commonly used measure is the Pearson correlation. To illustrate when not to use a Pearson correlation:

x = c(55,70,33,100,99,15,2,1,5,2000)
y = c(2,10,88,20,30,88,23,49,40,2000)

cor(x,y,method="pearson")
[1] 0.9957008
cor(x,y,method="spearman")
[1] -0.07294867
cor(log(x),log(y),method="pearson")
[1] 0.3556905

summary(x)
Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
1.00    7.50   44.00  238.00   91.75 2000.00
summary(y)
Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
2.00   20.75   35.00  235.00   78.25 2000.00

sd(x)
[1] 620.2714
sd(y)
[1] 620.8571


If we remove the 2,000 value:

x1 = c(55,70,33,100,99,15,2,1,5)
y1 = c(2,10,88,20,30,88,23,49,40)
cor(x1,y1,method="pearson")
[1] -0.4440288
cor(x1,y1,method="spearman")
[1] -0.4769916


Use a non-parametric correlation (e.g. Spearman's rank) measure if your dataset has outliers. It would probably be best to remove the outlier, since the negative correlation is further revealed afterwards in the Spearman's rank.

# DESeq vs. edgeR vs. baySeq

6th April 2012: For a more updated version of this post, please refer see this post.

### A very simple comparison

Using the TagSeqExample.tab file from the DESeq package as the benchmark dataset. According to DESeq authors, T1a and T1b are similar, so I removed the second column in the file corresponding to T1a:

cut --complement -f2 TagSeqExample.tab &gt; tag_seq_example_less_t1a.tsv


Hierarchical clustering

To get an idea of how similar the libraries are, I performed hierarchical clustering using the Spearman correlation of libraries (see here). Note that these raw reads have not been adjusted for sequencing depth, but hopefully by using a ranked correlation method we can bypass this necessity.

From the dendrogram, we observe the marked difference of the normal samples to the tumour samples.

Installing the R packages

I then installed all the required packages (R version 2.12.0, biocinstall version 2.7.4.):

source("http://www.bioconductor.org/biocLite.R")
biocLite("baySeq")
biocLite("DESeq")
biocLite("edgeR")


DESeq

DESeq code adapted from the vignette:

countsTable <- read.delim("tag_seq_example_less_t1a.tsv", header=TRUE, stringsAsFactors=TRUE)
rownames(countsTable) <- countsTable$gene countsTable <- countsTable[,-1] head(countsTable) conds = c("T","T","T","N","N") cds<-newCountDataSet(countsTable,conds) cds<-estimateSizeFactors(cds) sizeFactors(cds)   T1b T2 T3 N1 N2 0.5587394 1.5823096 1.1270425 1.2869337 0.8746998 cds <- estimateVarianceFunctions (cds) res <- nbinomTest (cds, "T", "N") resSig <- res[ res$padj < .1, ]
nrow(resSig)
[1] 1072
write.csv(resSig,file="deseq_res_sig.csv")


edgeR

First transform the tag_seq_example_less_t1a.tsv file to the edgeR format using this script.

tsv_to_edger.pl tag_seq_example_less_t1a.tsv

library(edgeR)
targets = read.delim(file = "targets.txt", stringsAsFactors = FALSE)
d$samples   files group description lib.size norm.factors 1 T1b patient patient 2399311 1 2 T2 patient patient 7202770 1 3 T3 patient patient 5856461 1 4 N1 control control 6376307 1 5 N2 control control 3931277 1 Recently edgeR added a normalisation function into the package called calcNormFactors(). For more information see this paper. d = calcNormFactors(d) d$samples

 files group description lib.size norm.factors 1 T1b patient patient 2399311 1.1167157 2 T2 patient patient 7202770 0.9978323 3 T3 patient patient 5856461 0.9170402 4 N1 control control 6376307 0.9546010 5 N2 control control 3931277 1.0251550

This approach is different from the one taken by DESeq.

d = estimateCommonDisp(d)
de.com = exactTest(d)
options(digits=4)
topTags(de.com)
detags.com = rownames(topTags(de.com)$table) sum(de.com$table$p.value < 0.01) [1] 895 sum(p.adjust(de.com$table$p.value,method="BH") < 0.1) [1] 593 good = sum(p.adjust(de.com$table$p.value,method="BH") < 0.1) goodList = topTags(de.com, n=good) sink("edger.txt") goodList sink()  baySeq library(baySeq) all = read.delim("tag_seq_example_less_t1a.tsv", header=TRUE, sep="\t") lib <- c(2399311,7202770,5856461,6376307,3931277) replicates <- c(1,1,1,2,2) groups <- list(NDE = c(1,1,1,1,1), DE = c(1,1,1,2,2) ) cname <- all[,1] all <- all[,-1] all = as.matrix(all) CD <- new("countData", data = all, replicates = replicates, libsizes = as.integer(lib), groups = groups) CD@annotation <- as.data.frame(cname) cl <- NULL CDP.NBML <- getPriors.NB(CD, samplesize = 1000, estimation = "QL", cl = cl) CDPost.NBML <- getLikelihoods.NB(CDP.NBML, pET = 'BIC', cl = cl) CDPost.NBML@estProps [1] 0.8897621 0.1102379 topCounts(CDPost.NBML, group=2) NBML.TPs <- getTPs(CDPost.NBML, group=2, TPs = 1:100) bayseq_de = topCounts(CDPost.NBML, group=2, number=2000) write.csv(bayseq_de, file="bayseq_de.csv")  Overlap  R package Number of differentially expressed genes DESeq 888 edgeR 895 baySeq 1115 Total 18,760 How to make a Venn diagram  DESeq with edgeR 591 DESeq with baySeq 488 edgeR with baySeq 465 DESeq with edgeR with baySeq 338 This is a first draft post and will be continually updated in the future, but I just thought I'll publish it first. # Calculating Pearson correlation using Perl My modification of code which is originally available here. Probably easier to understand the original code. I altered the code so that I could use an anonymous 2d array and with strictures, so that I could plug it into my own code. Comments are included in the code below to assist use. #!/usr/bin/perl use strict; use warnings; #make up some matrix for demonstration purposes my$x = [];
#generate 10 row
for (my $i=1;$i<11;++$i){ #generate 2 columns, where column 1 =$i and column 2 = $i for (my$j=1;$j<3;++$j){
$x->[$i][$j]=$i;
}
}
#end matrix code

my $correlation = correlation($x);
#Since the values in the columns are identical in $x, the correlation will be 1 print "$correlation\n";

#to use this code, remove the matrix code above
#generate an anonymous 2D array where $x->[1] is the row #$x->[1][1] is the value in row 1 column 1 and $x->[1][2] is the value of row 1 column 2 #once you build the entire array, pass it to the correlation subroutine as above #my$corrleation = correlation($x) #if you want to see what's inside$x use the code below
#for (my $i = 1;$i < scalar(@{$x}); ++$i){
#   my $line_to_print = ''; # for (my$j = 1; $j < scalar(@{$x->[$i]}); ++$j){
#      $line_to_print .= "$x->[$i]->[$j]\t";
#   }
#   $line_to_print =~ s/\t$//;
#   print "$line_to_print\n"; #} sub mean { my ($x)=@_;
my $num = scalar(@{$x}) - 1;
my $sum_x = '0'; my$sum_y = '0';
for (my $i = 1;$i < scalar(@{$x}); ++$i){
$sum_x +=$x->[$i][1];$sum_y += $x->[$i][2];
}
my $mu_x =$sum_x / $num; my$mu_y = $sum_y /$num;
return($mu_x,$mu_y);
}

### ss = sum of squared deviations to the mean
sub ss {
my ($x,$mean_x,$mean_y,$one,$two)=@_; my$sum = '0';
for (my $i=1;$i<scalar(@{$x});++$i){
$sum += ($x->[$i][$one]-$mean_x)*($x->[$i][$two]-$mean_y); } return$sum;
}

sub correlation {
my ($x) = @_; my ($mean_x,$mean_y) = mean($x);
my $ssxx=ss($x,$mean_x,$mean_y,1,1);
my $ssyy=ss($x,$mean_x,$mean_y,2,2);
my $ssxy=ss($x,$mean_x,$mean_y,1,2);
my $correl=correl($ssxx,$ssyy,$ssxy);
my $xcorrel=sprintf("%.4f",$correl);
return($xcorrel); } sub correl { my($ssxx,$ssyy,$ssxy)=@_;
my $sign=$ssxy/abs($ssxy); my$correl=$sign*sqrt($ssxy*$ssxy/($ssxx*$ssyy)); return$correl;
}


# Hierarchical clustering with p-values

The code, which allowed me to use the Spearman's rank correlation coefficient, was kindly provided to me by the developer of pvclust.

Firstly download the unofficial package or just source it from my DropBox account. Start up R and follow:

#load the package
#if you are having trouble sourcing from https run
#install.packages('devtools')
#library(devtools)
#source_url("https://dl.dropboxusercontent.com/u/15251811/pvclust/pvclust.R")
#source_url("https://dl.dropboxusercontent.com/u/15251811/pvclust/pvclust-internal.R")

source("https://dl.dropboxusercontent.com/u/15251811/pvclust/pvclust.R")
source("https://dl.dropboxusercontent.com/u/15251811/pvclust/pvclust-internal.R")

#use a test dataset from DESeq
#install DESeq if necessary
source("http://bioconductor.org/biocLite.R")
biocLite("DESeq")
library("DESeq")
example_file <- system.file ("extra/TagSeqExample.tab", package="DESeq")
T1a T1b  T2  T3  N1  N2
Gene_00001   0   0   2   0   0   1
Gene_00002  20   8  12   5  19  26
Gene_00003   3   0   2   0   0   0
Gene_00004  75  84 241 149 271 257
Gene_00005  10  16   4   0   4  10
Gene_00006 129 126 451 223 243 149

# Define a distance function

spearman <- function(x, ...) {
x <- as.matrix(x)
res <- as.dist(1 - cor(x, method = "spearman", use = "everything"))
res <- as.dist(res)
attr(res, "method") <- "spearman"
return(res)
}

result <- pvclust(data, method.dist=spearman, nboot=100)

result

Cluster method: average
Distance      : spearman

Estimates on edges:

au bp se.au se.bp v c pchi
1  1  1     0     0 0 0    0
2  1  1     0     0 0 0    0
3  1  1     0     0 0 0    0
4  1  1     0     0 0 0    0
5  1  1     0     0 0 0    0

#pvclust classed object
class(result)
[1] "pvclust"

names(result)
[1] "hclust" "edges"  "count"  "msfit"  "nboot"  "r"      "store"

plot(result)
pvrect(result, alpha=0.95)



Hierarchical clustering with p-values.