Creating a correlation matrix with R

Updated 2014 January 6th

This post on creating a correlation matrix with R was published in 2012 on January the 31st and has become one of the most viewed posts. I've learned a bit more since then, so I have updated and improved this post.

Incentive

Let be a matrix, where are elements of , where is the row and is the column.

If the matrix contained transcriptomic data, is the expression level of the transcript in the assay. The elements of the row of form the transcriptional response of the transcript. The elements of the column of form the expression profile of the assay.

Transcripts that have a similar transcriptional response may indicate that they are co-expressed together, and may belong to the same biological pathway. A simple way of looking at co-expression is through correlation, i.e., correlating all pairs of transcriptional responses, resulting in a correlation matrix.

Getting started

Let's get started with a simple example, with a matrix.

#create random matrix with numbers ranging from 1 to 100
set.seed(12345)
A <- matrix(runif(100,1,100),nrow=10,ncol=10,byrow=T)
image(A)

random_matrixHeatmap of

Now to calculate the correlations, using Spearman's rank correlation coefficient, of each row by each row using R and storing the results in a correlation matrix.

#our "expression" matrix
A
           [,1]      [,2]      [,3]      [,4]      [,5]     [,6]      [,7]      [,8]      [,9]     [,10]
 [1,] 72.369486 87.701546 76.337251 88.726332 46.191615 17.47081 33.184443 51.413209 73.042820 98.983957
 [2,]  4.419008 16.084976 73.832810  1.112522 39.729130 46.78697 39.426254 40.846029 18.717395 95.214217
 [3,] 45.919079 33.348488 96.576117 71.040706 64.809721 39.59302 70.155820 54.861729 23.420251 48.971218
 [4,] 79.507710  1.592775 19.583532 68.501529 37.640308 36.80093 87.010695 90.511312 62.125032 14.269132
 [5,] 78.437135 43.490683 92.800124 77.551079 26.708443 32.80124  6.959321  5.302189  6.450328 62.928737
 [6,] 96.482559 82.902984 32.187795 22.089520 73.517116 50.42486 73.247425  8.953268 44.117518 24.421465
 [7,] 79.365212 26.609747 98.612399 75.930501 97.998046 22.67584 94.922011 15.796337 60.435340 94.696644
 [8,] 69.146982 51.047839 37.920688 34.244699  5.776884 62.27581 96.183282 65.841090 51.518907 15.859723
 [9,] 87.174340 51.929726  1.856143  2.900282 15.306678 31.19814 82.740029 50.732120 80.553690  7.003358
[10,] 92.867559 81.009677  8.802521 60.491900 71.763208 51.87072 72.291534 75.244681 10.468412 40.384761

#row 1
A[1,]
 [1] 72.36949 87.70155 76.33725 88.72633 46.19162 17.47081 33.18444 51.41321 73.04282 98.98396
#row 2
A[2,]
 [1]  4.419008 16.084976 73.832810  1.112522 39.729130 46.786971 39.426254 40.846029 18.717395 95.214217

#calculating the Spearman's correlation between row 1 and row 2
cor(A[1,], A[2,], method="spearman")
[1] -0.1030303

#calculating all row by row correlations, remember the transpose t()
correlation_matrix <- cor(t(A), method="spearman")
correlation_matrix
             [,1]        [,2]        [,3]       [,4]        [,5]       [,6]       [,7]       [,8]
 [1,]  1.00000000 -0.10303030  0.05454545 -0.4545455  0.50303030 -0.3212121  0.1151515 -0.5393939
 [2,] -0.10303030  1.00000000  0.15151515 -0.3454545 -0.05454545 -0.3575758  0.2000000 -0.2242424
 [3,]  0.05454545  0.15151515  1.00000000  0.2363636  0.33333333 -0.3696970  0.6121212 -0.2121212
 [4,] -0.45454545 -0.34545455  0.23636364  1.0000000 -0.43030303 -0.1757576 -0.1636364  0.6121212
 [5,]  0.50303030 -0.05454545  0.33333333 -0.4303030  1.00000000  0.1515152  0.4545455 -0.2969697
 [6,] -0.32121212 -0.35757576 -0.36969697 -0.1757576  0.15151515  1.0000000  0.2000000  0.2242424
 [7,]  0.11515152  0.20000000  0.61212121 -0.1636364  0.45454545  0.2000000  1.0000000 -0.3575758
 [8,] -0.53939394 -0.22424242 -0.21212121  0.6121212 -0.29696970  0.2242424 -0.3575758  1.0000000
 [9,] -0.41818182 -0.45454545 -0.56363636  0.4424242 -0.40606061  0.5878788 -0.2969697  0.7696970
[10,] -0.22424242 -0.53939394 -0.17575758  0.3818182 -0.12727273  0.4909091 -0.3090909  0.4545455
            [,9]      [,10]
 [1,] -0.4181818 -0.2242424
 [2,] -0.4545455 -0.5393939
 [3,] -0.5636364 -0.1757576
 [4,]  0.4424242  0.3818182
 [5,] -0.4060606 -0.1272727
 [6,]  0.5878788  0.4909091
 [7,] -0.2969697 -0.3090909
 [8,]  0.7696970  0.4545455
 [9,]  1.0000000  0.6242424
[10,]  0.6242424  1.0000000

#output the matrix in a file
#using rounded values
library(MASS)
write.matrix(round(correlation_matrix, digits=3), file="cor_matrix.txt")

If we look at the file cor_matrix.txt:

cat cor_matrix.txt
 1.000 -0.103  0.055 -0.455  0.503 -0.321  0.115 -0.539 -0.418 -0.224
-0.103  1.000  0.152 -0.345 -0.055 -0.358  0.200 -0.224 -0.455 -0.539
 0.055  0.152  1.000  0.236  0.333 -0.370  0.612 -0.212 -0.564 -0.176
-0.455 -0.345  0.236  1.000 -0.430 -0.176 -0.164  0.612  0.442  0.382
 0.503 -0.055  0.333 -0.430  1.000  0.152  0.455 -0.297 -0.406 -0.127
-0.321 -0.358 -0.370 -0.176  0.152  1.000  0.200  0.224  0.588  0.491
 0.115  0.200  0.612 -0.164  0.455  0.200  1.000 -0.358 -0.297 -0.309
-0.539 -0.224 -0.212  0.612 -0.297  0.224 -0.358  1.000  0.770  0.455
-0.418 -0.455 -0.564  0.442 -0.406  0.588 -0.297  0.770  1.000  0.624
-0.224 -0.539 -0.176  0.382 -0.127  0.491 -0.309  0.455  0.624  1.000

What about bigger matrices?

The computational time increases exponentially as we have more rows, since we are calculating the correlations between every row pair. For a matrix with 10005 rows and 40 columns, there would be:

choose(10005,2)
[1] 50045010

50,045,010 comparisons. It took ~33 hours to do all the calculations on one core of a Intel(R) Xeon(R) CPU X7560 @ 2.27GHz.

Let's create a plot of the number of rows versus the number of pairwise comparisons:

#for a dataset with 10 rows
choose(10,2)
[1] 45

#for 195,433
choose(195433,2)
[1] 19096931028

#store comparisons in an array
row_number <- 1000000
x <- array(0, dim=c(row_number,1))
system.time(
for (i in 1:row_number){
  x[i,1]<- choose(i,2)
}
)
#   user  system elapsed 
#   3.39    0.00    3.39
tail(x)
                  [,1]
[999995,] 499994500015
[999996,] 499995500010
[999997,] 499996500006
[999998,] 499997500003
[999999,] 499998500001
[1e+06,]  499999500000

plot(x,type="l")

number_of_comparison

As I mentioned above, it took roughly 33 hours to compute the correlations of a 10,005 by 40 matrix (50,045,010 comparisons) on one core.

We can speed this up by using more cores.

Parallel calculation of correlations

This fantastic guide on Parallel Computing in R shows how we can parallelise the calculations and I will follow their example. They used the test dataset included in the package Biobase:

#install if necessary
source("http://bioconductor.org/biocLite.R")
biocLite("Biobase")

#load the data
data(geneData, package = "Biobase")
dim(geneData)
#[1] 500  26

#how many comparisons?
choose(500,2)
[1] 124750

The comparisons go like this:

row1 by row2
row1 by row3
...

We can use the combn() function to generate the comparisons as a list:

#for 3 rows
combn(3, 2, simplify=F)
#[[1]]
#[1] 1 2

#[[2]]
#[1] 1 3

#[[3]]
#[1] 2 3

#for all rows of geneData
pair <- combn(1:nrow(geneData), 2, simplify = F)

#now we need to create a function to calculate the correlations
#taking the input of the object pair
geneCor <- function(x, gene = geneData) {
   cor(gene[x[1], ], gene[x[2], ])
}

#to find the correlation of row 1 and 5
geneCor(c(1,5))
#[1] 0.6572192

#all correlations, stored in out
system.time(out <- lapply(pair, geneCor))
#   user  system elapsed
#  9.153   0.012   9.181

#results
head(out, 3)
[[1]]
[1] 0.1536158

[[2]]
[1] 0.7066034

[[3]]
[1] -0.2125437

#we can parallelise this by using
#the package multicore
#install if necessary
install.packages("multicore")

#load library
library(multicore)

#use the mclapply() function
system.time(out <- mclapply(pair, geneCor))
#   user  system elapsed
# 11.083   0.561   1.206

corm <- cbind(do.call(rbind, pair), unlist(out))
write.table(corm, file = "correlation.tsv", quote = F, sep = "\t")

Using mclapply took ~1 sec compared to ~9.

Creating a co-expression network

In the R code above, I saved the correlation matrix using the function write.matrix(). Below is a simple Perl script that parses the output from write.matrix() and creates a sif file, which can then be loaded into Cytoscape. Please note that this script does not capture negative correlations, only positive correlations that are equal to or higher than the $threshold value.

#!/usr/bin/perl

use strict;
use warnings;

my $usage = "Usage: $0 <infile.txt> <correlation>\n";
my $infile = shift or die $usage;
my $cor = shift or die $usage;

my @name = ();
my %sif = ();
my $counter = 0;
open(IN,'<',$infile) || die "Could not open $infile: $!\n";
while(<IN>){
   chomp;
   if ($. == 1){
      @name = split(/\s+/);
      next;
   }
   my $current = $. - 2;
   my $counter = $counter + $. - 1;
   my @cor = split();
   print "$counter $current\n";
   for (my $i=$counter; $i<scalar(@cor); ++$i){
      if ($cor[$i] >= $cor){
         #print join("\t",$name[$current],$name[$i],$current,$i,$cor[$i]),"\n";
         print join("\t",$name[$current],'xx',$name[$i]),"\n";
      }
   }
}
close(IN);

exit(0);

Basically the script examines the correlation matrix and prints out associations if the correlation between two rows is greater than $threshold. Here's how it looks when loaded into Cytoscape:

While most transcripts have a very similar expression pattern, several transcripts exhibit unique transcriptional responses.

Correlation network within R

An example of creating a correlation network was also available from the guide on Parallel Computing in R. Following on from above:

data(geneData, package = "Biobase")
pair <- combn(1:nrow(geneData), 2, simplify = F)
geneCor <- function(x, gene = geneData) {
   cor(gene[x[1], ], gene[x[2], ])
}
out <- lapply(pair, geneCor)

head(out, n=3)
#[[1]]
#[1] 0.1536158
#
#[[2]]
#[1] 0.7066034
#
#[[3]]
#[1] -0.2125437

#reshape the result
corm <- cbind(do.call(rbind, pair), unlist(out))
head(corm,n=3)
     [,1] [,2]       [,3]
[1,]    1    2  0.1536158
[2,]    1    3  0.7066034
[3,]    1    4 -0.2125437

#remove low cor pairs
corm <- corm[abs(corm[,3]) >= 0.86, ]
dim(corm)
#[1] 112   3

#install packages if necessary
install.packages("network")
install.packages("sna")

library(network)
library(sna)

#the network
net <- network(corm, directed = F)
net
# Network attributes:
#  vertices = 498 
#  directed = FALSE 
#  hyper = FALSE 
#  loops = FALSE 
#  multiple = FALSE 
#  bipartite = FALSE 
#  total edges= 112 
#    missing edges= 0 
#    non-missing edges= 112 
#
# Vertex attribute names: 
#    vertex.names 
#
#No edge attributes

#component analysis
cd <- component.dist(net)

#delete genes not connected with others
delete.vertices(net, which(cd$csize[cd$membership] == 1))

plot(net)

network

Using a real dataset

I will use the pnas_expression.txt file, which is a processed dataset from this this study.

data <-read.table("pnas_expression.txt",header=T,row.names=1)
dim(data)
[1] 37435     8
head(data)
                lane1 lane2 lane3 lane4 lane5 lane6 lane8  len
ENSG00000215696     0     0     0     0     0     0     0  330
ENSG00000215700     0     0     0     0     0     0     0 2370
ENSG00000215699     0     0     0     0     0     0     0 1842
ENSG00000215784     0     0     0     0     0     0     0 2393
ENSG00000212914     0     0     0     0     0     0     0  384
ENSG00000212042     0     0     0     0     0     0     0   92
#get rid of the len column
data <- data[,-data$len]
#check how many rows are all zeros
table(rowSums(data)==0)

FALSE  TRUE 
21877 15558
#create a smaller subset as an example
data_subset <- data[rowSums(data)>500,]
dim(data_subset)
[1] 4010    7

#save row names
my_row <- rownames(data_subset)

#convert into matrix
data_subset_matrix <- as.matrix(data_subset)

#correlation of row 1 and row 2
cor(data_subset_matrix[1,],data_subset_matrix[2,],method="spearman")
[1] 0.5357143

#correlation matrix
correlation_matrix <- cor(t(data_subset_matrix), method="spearman")
correlation_matrix[1,2]
[1] 0.5357143
dim(correlation_matrix)
[1] 4010 4010

#output results
library(MASS)
write.matrix(correlation_matrix, file="pnas_expression_correlation.tsv")

The Perl script above can be used to convert the pnas_expression_correlation.tsv file into a sif file, which can then be loaded into Cytoscape.

Conclusions

Creating a correlation matrix with R is quite easy and as I have shown, the results can be visualised using Cytoscape. When applied to transcriptomic datasets, this may be useful in identifying co-expressed transcripts. I've shown an example of this using a real dataset, however note that in the example there are relatively few assays or samples. This may limit the usefulness of this approach since the number of transcriptional responses are smaller. However, imagine a dataset with a large number of assays, such as transcriptional profiling of a large panel of tissues. Using this approach, we may be able to unveil tissue specific transcripts since they will have very unique transcriptional responses.

Finally one can use the multicore package to speed up a large number of calculations.

DESeq vs. edgeR vs. baySeq

6th April 2012: For a more updated version of this post, please refer see this post.

A very simple comparison

Using the TagSeqExample.tab file from the DESeq package as the benchmark dataset. According to DESeq authors, T1a and T1b are similar, so I removed the second column in the file corresponding to T1a:

cut --complement -f2 TagSeqExample.tab &gt; tag_seq_example_less_t1a.tsv

Hierarchical clustering

To get an idea of how similar the libraries are, I performed hierarchical clustering using the Spearman correlation of libraries (see here). Note that these raw reads have not been adjusted for sequencing depth, but hopefully by using a ranked correlation method we can bypass this necessity.

From the dendrogram, we observe the marked difference of the normal samples to the tumour samples.

Installing the R packages

I then installed all the required packages (R version 2.12.0, biocinstall version 2.7.4.):

source("http://www.bioconductor.org/biocLite.R")
biocLite("baySeq")
biocLite("DESeq")
biocLite("edgeR")

DESeq

DESeq code adapted from the vignette:

countsTable <- read.delim("tag_seq_example_less_t1a.tsv", header=TRUE, stringsAsFactors=TRUE)
rownames(countsTable) <- countsTable$gene
countsTable <- countsTable[,-1]
head(countsTable)
conds = c("T","T","T","N","N")
cds<-newCountDataSet(countsTable,conds)
cds<-estimateSizeFactors(cds)
sizeFactors(cds)
T1b T2 T3 N1 N2
0.5587394 1.5823096 1.1270425 1.2869337 0.8746998
cds <- estimateVarianceFunctions (cds)
res <- nbinomTest (cds, "T", "N")
resSig <- res[ res$padj < .1, ]
nrow(resSig)
[1] 1072
write.csv(resSig,file="deseq_res_sig.csv")

edgeR

First transform the tag_seq_example_less_t1a.tsv file to the edgeR format using this script.

tsv_to_edger.pl tag_seq_example_less_t1a.tsv

library(edgeR)
targets = read.delim(file = "targets.txt", stringsAsFactors = FALSE)
d = readDGE(targets,comment.char="#")
d$samples
files group description lib.size norm.factors
1 T1b patient patient 2399311 1
2 T2 patient patient 7202770 1
3 T3 patient patient 5856461 1
4 N1 control control 6376307 1
5 N2 control control 3931277 1

Recently edgeR added a normalisation function into the package called calcNormFactors(). For more information see this paper.

d = calcNormFactors(d)
d$samples
files group description lib.size norm.factors
1 T1b patient patient 2399311 1.1167157
2 T2 patient patient 7202770 0.9978323
3 T3 patient patient 5856461 0.9170402
4 N1 control control 6376307 0.9546010
5 N2 control control 3931277 1.0251550

This approach is different from the one taken by DESeq.

d = estimateCommonDisp(d)
de.com = exactTest(d)
options(digits=4)
topTags(de.com)
detags.com = rownames(topTags(de.com)$table)

sum(de.com$table$p.value < 0.01)
[1] 895

sum(p.adjust(de.com$table$p.value,method="BH") < 0.1)
[1] 593

good = sum(p.adjust(de.com$table$p.value,method="BH") < 0.1)
goodList = topTags(de.com, n=good)
sink("edger.txt")
goodList
sink()

baySeq

library(baySeq)
all = read.delim("tag_seq_example_less_t1a.tsv", header=TRUE, sep="\t")
lib <- c(2399311,7202770,5856461,6376307,3931277)
replicates <- c(1,1,1,2,2)
groups <- list(NDE = c(1,1,1,1,1), DE = c(1,1,1,2,2) )
cname <- all[,1]
all <- all[,-1]
all = as.matrix(all)
CD <- new("countData", data = all, replicates = replicates, libsizes = as.integer(lib), groups = groups)
CD@annotation <- as.data.frame(cname)
cl <- NULL
CDP.NBML <- getPriors.NB(CD, samplesize = 1000, estimation = "QL", cl = cl)
CDPost.NBML <- getLikelihoods.NB(CDP.NBML, pET = 'BIC', cl = cl)
CDPost.NBML@estProps
[1] 0.8897621 0.1102379
topCounts(CDPost.NBML, group=2)
NBML.TPs <- getTPs(CDPost.NBML, group=2, TPs = 1:100)
bayseq_de = topCounts(CDPost.NBML, group=2, number=2000)
write.csv(bayseq_de, file="bayseq_de.csv")

Overlap

R package Number of differentially expressed genes
DESeq 888
edgeR 895
baySeq 1115
Total 18,760

How to make a Venn diagram

DESeq with edgeR 591
DESeq with baySeq 488
edgeR with baySeq 465
DESeq with edgeR with baySeq 338

This is a first draft post and will be continually updated in the future, but I just thought I'll publish it first.

Hierarchical clustering with p-values

The code, which allowed me to use the Spearman's rank correlation coefficient, was kindly provided to me by the developer of pvclust.

Firstly download the unofficial package or just source it from my DropBox account. Start up R and follow:

#load the package
#if you are having trouble sourcing from https run
#install.packages('devtools')
#library(devtools)
#source_url("https://dl.dropboxusercontent.com/u/15251811/pvclust/pvclust.R")
#source_url("https://dl.dropboxusercontent.com/u/15251811/pvclust/pvclust-internal.R")

source("https://dl.dropboxusercontent.com/u/15251811/pvclust/pvclust.R")
source("https://dl.dropboxusercontent.com/u/15251811/pvclust/pvclust-internal.R")

#use a test dataset from DESeq
#install DESeq if necessary
source("http://bioconductor.org/biocLite.R")
biocLite("DESeq")
#load DESeq
library("DESeq")
example_file <- system.file ("extra/TagSeqExample.tab", package="DESeq")
data <- read.delim(example_file, header=T, row.names="gene")
head(data)
           T1a T1b  T2  T3  N1  N2
Gene_00001   0   0   2   0   0   1
Gene_00002  20   8  12   5  19  26
Gene_00003   3   0   2   0   0   0
Gene_00004  75  84 241 149 271 257
Gene_00005  10  16   4   0   4  10
Gene_00006 129 126 451 223 243 149

# Define a distance function

spearman <- function(x, ...) {
    x <- as.matrix(x)
    res <- as.dist(1 - cor(x, method = "spearman", use = "everything"))
    res <- as.dist(res)
    attr(res, "method") <- "spearman"
    return(res)
}

result <- pvclust(data, method.dist=spearman, nboot=100)

result

Cluster method: average
Distance      : spearman

Estimates on edges:

  au bp se.au se.bp v c pchi
1  1  1     0     0 0 0    0
2  1  1     0     0 0 0    0
3  1  1     0     0 0 0    0
4  1  1     0     0 0 0    0
5  1  1     0     0 0 0    0

#pvclust classed object
class(result)
[1] "pvclust"

names(result)
[1] "hclust" "edges"  "count"  "msfit"  "nboot"  "r"      "store"

plot(result)
pvrect(result, alpha=0.95)

pvclust_exampleHierarchical clustering with p-values.