Updated 2014 January 6th

This post on creating a correlation matrix with R was published in 2012 on January the 31st and has become one of the most viewed posts. I've learned a bit more since then, so I have updated and improved this post.

### Incentive

Let be a matrix, where are elements of , where is the row and is the column.

If the matrix contained transcriptomic data, is the expression level of the transcript in the assay. The elements of the row of form the transcriptional response of the transcript. The elements of the column of form the expression profile of the assay.

Transcripts that have a similar transcriptional response may indicate that they are co-expressed together, and may belong to the same biological pathway. A simple way of looking at co-expression is through correlation, i.e., correlating all pairs of transcriptional responses, resulting in a correlation matrix.

### Getting started

Let's get started with a simple example, with a matrix.

#create random matrix with numbers ranging from 1 to 100 set.seed(12345) A <- matrix(runif(100,1,100),nrow=10,ncol=10,byrow=T) image(A)

Now to calculate the correlations, using Spearman's rank correlation coefficient, of each row by each row using R and storing the results in a correlation matrix.

#our "expression" matrix A [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [1,] 72.369486 87.701546 76.337251 88.726332 46.191615 17.47081 33.184443 51.413209 73.042820 98.983957 [2,] 4.419008 16.084976 73.832810 1.112522 39.729130 46.78697 39.426254 40.846029 18.717395 95.214217 [3,] 45.919079 33.348488 96.576117 71.040706 64.809721 39.59302 70.155820 54.861729 23.420251 48.971218 [4,] 79.507710 1.592775 19.583532 68.501529 37.640308 36.80093 87.010695 90.511312 62.125032 14.269132 [5,] 78.437135 43.490683 92.800124 77.551079 26.708443 32.80124 6.959321 5.302189 6.450328 62.928737 [6,] 96.482559 82.902984 32.187795 22.089520 73.517116 50.42486 73.247425 8.953268 44.117518 24.421465 [7,] 79.365212 26.609747 98.612399 75.930501 97.998046 22.67584 94.922011 15.796337 60.435340 94.696644 [8,] 69.146982 51.047839 37.920688 34.244699 5.776884 62.27581 96.183282 65.841090 51.518907 15.859723 [9,] 87.174340 51.929726 1.856143 2.900282 15.306678 31.19814 82.740029 50.732120 80.553690 7.003358 [10,] 92.867559 81.009677 8.802521 60.491900 71.763208 51.87072 72.291534 75.244681 10.468412 40.384761 #row 1 A[1,] [1] 72.36949 87.70155 76.33725 88.72633 46.19162 17.47081 33.18444 51.41321 73.04282 98.98396 #row 2 A[2,] [1] 4.419008 16.084976 73.832810 1.112522 39.729130 46.786971 39.426254 40.846029 18.717395 95.214217 #calculating the Spearman's correlation between row 1 and row 2 cor(A[1,], A[2,], method="spearman") [1] -0.1030303 #calculating all row by row correlations, remember the transpose t() correlation_matrix <- cor(t(A), method="spearman") correlation_matrix [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [1,] 1.00000000 -0.10303030 0.05454545 -0.4545455 0.50303030 -0.3212121 0.1151515 -0.5393939 [2,] -0.10303030 1.00000000 0.15151515 -0.3454545 -0.05454545 -0.3575758 0.2000000 -0.2242424 [3,] 0.05454545 0.15151515 1.00000000 0.2363636 0.33333333 -0.3696970 0.6121212 -0.2121212 [4,] -0.45454545 -0.34545455 0.23636364 1.0000000 -0.43030303 -0.1757576 -0.1636364 0.6121212 [5,] 0.50303030 -0.05454545 0.33333333 -0.4303030 1.00000000 0.1515152 0.4545455 -0.2969697 [6,] -0.32121212 -0.35757576 -0.36969697 -0.1757576 0.15151515 1.0000000 0.2000000 0.2242424 [7,] 0.11515152 0.20000000 0.61212121 -0.1636364 0.45454545 0.2000000 1.0000000 -0.3575758 [8,] -0.53939394 -0.22424242 -0.21212121 0.6121212 -0.29696970 0.2242424 -0.3575758 1.0000000 [9,] -0.41818182 -0.45454545 -0.56363636 0.4424242 -0.40606061 0.5878788 -0.2969697 0.7696970 [10,] -0.22424242 -0.53939394 -0.17575758 0.3818182 -0.12727273 0.4909091 -0.3090909 0.4545455 [,9] [,10] [1,] -0.4181818 -0.2242424 [2,] -0.4545455 -0.5393939 [3,] -0.5636364 -0.1757576 [4,] 0.4424242 0.3818182 [5,] -0.4060606 -0.1272727 [6,] 0.5878788 0.4909091 [7,] -0.2969697 -0.3090909 [8,] 0.7696970 0.4545455 [9,] 1.0000000 0.6242424 [10,] 0.6242424 1.0000000 #output the matrix in a file #using rounded values library(MASS) write.matrix(round(correlation_matrix, digits=3), file="cor_matrix.txt")

If we look at the file cor_matrix.txt:

cat cor_matrix.txt 1.000 -0.103 0.055 -0.455 0.503 -0.321 0.115 -0.539 -0.418 -0.224 -0.103 1.000 0.152 -0.345 -0.055 -0.358 0.200 -0.224 -0.455 -0.539 0.055 0.152 1.000 0.236 0.333 -0.370 0.612 -0.212 -0.564 -0.176 -0.455 -0.345 0.236 1.000 -0.430 -0.176 -0.164 0.612 0.442 0.382 0.503 -0.055 0.333 -0.430 1.000 0.152 0.455 -0.297 -0.406 -0.127 -0.321 -0.358 -0.370 -0.176 0.152 1.000 0.200 0.224 0.588 0.491 0.115 0.200 0.612 -0.164 0.455 0.200 1.000 -0.358 -0.297 -0.309 -0.539 -0.224 -0.212 0.612 -0.297 0.224 -0.358 1.000 0.770 0.455 -0.418 -0.455 -0.564 0.442 -0.406 0.588 -0.297 0.770 1.000 0.624 -0.224 -0.539 -0.176 0.382 -0.127 0.491 -0.309 0.455 0.624 1.000

### What about bigger matrices?

The computational time increases exponentially as we have more rows, since we are calculating the correlations between every row pair. For a matrix with 10005 rows and 40 columns, there would be:

choose(10005,2) [1] 50045010

50,045,010 comparisons. It took ~33 hours to do all the calculations on one core of a Intel(R) Xeon(R) CPU X7560 @ 2.27GHz.

Let's create a plot of the number of rows versus the number of pairwise comparisons:

#for a dataset with 10 rows choose(10,2) [1] 45 #for 195,433 choose(195433,2) [1] 19096931028 #store comparisons in an array row_number <- 1000000 x <- array(0, dim=c(row_number,1)) system.time( for (i in 1:row_number){ x[i,1]<- choose(i,2) } ) # user system elapsed # 3.39 0.00 3.39 tail(x) [,1] [999995,] 499994500015 [999996,] 499995500010 [999997,] 499996500006 [999998,] 499997500003 [999999,] 499998500001 [1e+06,] 499999500000 plot(x,type="l")

As I mentioned above, it took roughly 33 hours to compute the correlations of a 10,005 by 40 matrix (50,045,010 comparisons) on one core.

We can speed this up by using more cores.

### Parallel calculation of correlations

This fantastic guide on Parallel Computing in R shows how we can parallelise the calculations and I will follow their example. They used the test dataset included in the package Biobase:

#install if necessary source("http://bioconductor.org/biocLite.R") biocLite("Biobase") #load the data data(geneData, package = "Biobase") dim(geneData) #[1] 500 26 #how many comparisons? choose(500,2) [1] 124750

The comparisons go like this:

row1 by row2

row1 by row3

...

We can use the combn() function to generate the comparisons as a list:

#for 3 rows combn(3, 2, simplify=F) #[[1]] #[1] 1 2 #[[2]] #[1] 1 3 #[[3]] #[1] 2 3 #for all rows of geneData pair <- combn(1:nrow(geneData), 2, simplify = F) #now we need to create a function to calculate the correlations #taking the input of the object pair geneCor <- function(x, gene = geneData) { cor(gene[x[1], ], gene[x[2], ]) } #to find the correlation of row 1 and 5 geneCor(c(1,5)) #[1] 0.6572192 #all correlations, stored in out system.time(out <- lapply(pair, geneCor)) # user system elapsed # 9.153 0.012 9.181 #results head(out, 3) [[1]] [1] 0.1536158 [[2]] [1] 0.7066034 [[3]] [1] -0.2125437 #we can parallelise this by using #the package multicore #install if necessary install.packages("multicore") #load library library(multicore) #use the mclapply() function system.time(out <- mclapply(pair, geneCor)) # user system elapsed # 11.083 0.561 1.206 corm <- cbind(do.call(rbind, pair), unlist(out)) write.table(corm, file = "correlation.tsv", quote = F, sep = "\t")

Using mclapply took ~1 sec compared to ~9.

### Creating a co-expression network

In the R code above, I saved the correlation matrix using the function write.matrix(). Below is a simple Perl script that parses the output from write.matrix() and creates a sif file, which can then be loaded into Cytoscape. Please note that this script does not capture negative correlations, only positive correlations that are equal to or higher than the $threshold value.

#!/usr/bin/perl use strict; use warnings; my $usage = "Usage: $0 <infile.txt> <correlation>\n"; my $infile = shift or die $usage; my $cor = shift or die $usage; my @name = (); my %sif = (); my $counter = 0; open(IN,'<',$infile) || die "Could not open $infile: $!\n"; while(<IN>){ chomp; if ($. == 1){ @name = split(/\s+/); next; } my $current = $. - 2; my $counter = $counter + $. - 1; my @cor = split(); print "$counter $current\n"; for (my $i=$counter; $i<scalar(@cor); ++$i){ if ($cor[$i] >= $cor){ #print join("\t",$name[$current],$name[$i],$current,$i,$cor[$i]),"\n"; print join("\t",$name[$current],'xx',$name[$i]),"\n"; } } } close(IN); exit(0);

Basically the script examines the correlation matrix and prints out associations if the correlation between two rows is greater than $threshold. Here's how it looks when loaded into Cytoscape:

*While most transcripts have a very similar expression pattern, several transcripts exhibit unique transcriptional responses.*

### Correlation network within R

An example of creating a correlation network was also available from the guide on Parallel Computing in R. Following on from above:

data(geneData, package = "Biobase") pair <- combn(1:nrow(geneData), 2, simplify = F) geneCor <- function(x, gene = geneData) { cor(gene[x[1], ], gene[x[2], ]) } out <- lapply(pair, geneCor) head(out, n=3) #[[1]] #[1] 0.1536158 # #[[2]] #[1] 0.7066034 # #[[3]] #[1] -0.2125437 #reshape the result corm <- cbind(do.call(rbind, pair), unlist(out)) head(corm,n=3) [,1] [,2] [,3] [1,] 1 2 0.1536158 [2,] 1 3 0.7066034 [3,] 1 4 -0.2125437 #remove low cor pairs corm <- corm[abs(corm[,3]) >= 0.86, ] dim(corm) #[1] 112 3 #install packages if necessary install.packages("network") install.packages("sna") library(network) library(sna) #the network net <- network(corm, directed = F) net # Network attributes: # vertices = 498 # directed = FALSE # hyper = FALSE # loops = FALSE # multiple = FALSE # bipartite = FALSE # total edges= 112 # missing edges= 0 # non-missing edges= 112 # # Vertex attribute names: # vertex.names # #No edge attributes #component analysis cd <- component.dist(net) #delete genes not connected with others delete.vertices(net, which(cd$csize[cd$membership] == 1)) plot(net)

### Using a real dataset

I will use the pnas_expression.txt file, which is a processed dataset from this this study.

data <-read.table("pnas_expression.txt",header=T,row.names=1) dim(data) [1] 37435 8 head(data) lane1 lane2 lane3 lane4 lane5 lane6 lane8 len ENSG00000215696 0 0 0 0 0 0 0 330 ENSG00000215700 0 0 0 0 0 0 0 2370 ENSG00000215699 0 0 0 0 0 0 0 1842 ENSG00000215784 0 0 0 0 0 0 0 2393 ENSG00000212914 0 0 0 0 0 0 0 384 ENSG00000212042 0 0 0 0 0 0 0 92 #get rid of the len column data <- data[,-data$len] #check how many rows are all zeros table(rowSums(data)==0) FALSE TRUE 21877 15558 #create a smaller subset as an example data_subset <- data[rowSums(data)>500,] dim(data_subset) [1] 4010 7 #save row names my_row <- rownames(data_subset) #convert into matrix data_subset_matrix <- as.matrix(data_subset) #correlation of row 1 and row 2 cor(data_subset_matrix[1,],data_subset_matrix[2,],method="spearman") [1] 0.5357143 #correlation matrix correlation_matrix <- cor(t(data_subset_matrix), method="spearman") correlation_matrix[1,2] [1] 0.5357143 dim(correlation_matrix) [1] 4010 4010 #output results library(MASS) write.matrix(correlation_matrix, file="pnas_expression_correlation.tsv")

The Perl script above can be used to convert the pnas_expression_correlation.tsv file into a sif file, which can then be loaded into Cytoscape.

### Conclusions

Creating a correlation matrix with R is quite easy and as I have shown, the results can be visualised using Cytoscape. When applied to transcriptomic datasets, this may be useful in identifying co-expressed transcripts. I've shown an example of this using a real dataset, however note that in the example there are relatively few assays or samples. This may limit the usefulness of this approach since the number of transcriptional responses are smaller. However, imagine a dataset with a large number of assays, such as transcriptional profiling of a large panel of tissues. Using this approach, we may be able to unveil tissue specific transcripts since they will have very unique transcriptional responses.

Finally one can use the multicore package to speed up a large number of calculations.