Creating a correlation matrix with R

Updated 2024 April 7th

Incentive

Let A be a m \times n matrix, where a_{ij} are elements of A, where i is the i_{th} row and j is the j_{th} column.


A =  \begin{bmatrix} a_{11} & \cdots & a_{1j} & \cdots & a_{1n} \\\\ \vdots & \ddots & \vdots && \vdots \\\\ a_{i1} & \cdots & a_{ij} & \cdots & a_{in} \\\\ \vdots && \vdots & \ddots & \vdots \\\\ a_{m1} & \cdots & a_{mj} & \cdots & a_{mn} \end{bmatrix}

If the matrix A contained transcript expression data, then a_{ij} is the expression level of the i^{th} transcript in the j^{th} assay. The elements of the i^{th} row of A form the transcriptional response of the i^{th} transcript. The elements of the j^{th} column of A form the expression profile of the j^{th} assay.

Transcripts that have a similar transcriptional response may indicate that they are co-expressed together and could have related biological functions. A simple way of looking at co-expression is through correlation, i.e., correlating all pairs of transcriptional responses, which results in a correlation matrix.

Getting started

Let's get started with a simple example, with a $$10 \times 10$$ matrix.

create random matrix with numbers ranging from 1 to 100

set.seed(12345)
A <- matrix(runif(100,1,100),nrow=10,ncol=10,byrow=T)
image(A)

random_matrixHeatmap of $$A$$

Now to calculate the correlations, using Spearman's rank correlation coefficient, of each row by each row using R and storing the results in a correlation matrix.

our "expression" matrix

A
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 72.369486 87.701546 76.337251 88.726332 46.191615 17.47081 33.184443 51.413209 73.042820 98.983957
[2,] 4.419008 16.084976 73.832810 1.112522 39.729130 46.78697 39.426254 40.846029 18.717395 95.214217
[3,] 45.919079 33.348488 96.576117 71.040706 64.809721 39.59302 70.155820 54.861729 23.420251 48.971218
[4,] 79.507710 1.592775 19.583532 68.501529 37.640308 36.80093 87.010695 90.511312 62.125032 14.269132
[5,] 78.437135 43.490683 92.800124 77.551079 26.708443 32.80124 6.959321 5.302189 6.450328 62.928737
[6,] 96.482559 82.902984 32.187795 22.089520 73.517116 50.42486 73.247425 8.953268 44.117518 24.421465
[7,] 79.365212 26.609747 98.612399 75.930501 97.998046 22.67584 94.922011 15.796337 60.435340 94.696644
[8,] 69.146982 51.047839 37.920688 34.244699 5.776884 62.27581 96.183282 65.841090 51.518907 15.859723
[9,] 87.174340 51.929726 1.856143 2.900282 15.306678 31.19814 82.740029 50.732120 80.553690 7.003358
[10,] 92.867559 81.009677 8.802521 60.491900 71.763208 51.87072 72.291534 75.244681 10.468412 40.384761

row 1

A[1,]
[1] 72.36949 87.70155 76.33725 88.72633 46.19162 17.47081 33.18444 51.41321 73.04282 98.98396

row 2

A[2,]
[1] 4.419008 16.084976 73.832810 1.112522 39.729130 46.786971 39.426254 40.846029 18.717395 95.214217

calculating the Spearman's correlation between row 1 and row 2

cor(A[1,], A[2,], method="spearman")
[1] -0.1030303

calculating all row by row correlations, remember the transpose t()

correlation_matrix <- cor(t(A), method="spearman")
correlation_matrix
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
[1,] 1.00000000 -0.10303030 0.05454545 -0.4545455 0.50303030 -0.3212121 0.1151515 -0.5393939
[2,] -0.10303030 1.00000000 0.15151515 -0.3454545 -0.05454545 -0.3575758 0.2000000 -0.2242424
[3,] 0.05454545 0.15151515 1.00000000 0.2363636 0.33333333 -0.3696970 0.6121212 -0.2121212
[4,] -0.45454545 -0.34545455 0.23636364 1.0000000 -0.43030303 -0.1757576 -0.1636364 0.6121212
[5,] 0.50303030 -0.05454545 0.33333333 -0.4303030 1.00000000 0.1515152 0.4545455 -0.2969697
[6,] -0.32121212 -0.35757576 -0.36969697 -0.1757576 0.15151515 1.0000000 0.2000000 0.2242424
[7,] 0.11515152 0.20000000 0.61212121 -0.1636364 0.45454545 0.2000000 1.0000000 -0.3575758
[8,] -0.53939394 -0.22424242 -0.21212121 0.6121212 -0.29696970 0.2242424 -0.3575758 1.0000000
[9,] -0.41818182 -0.45454545 -0.56363636 0.4424242 -0.40606061 0.5878788 -0.2969697 0.7696970
[10,] -0.22424242 -0.53939394 -0.17575758 0.3818182 -0.12727273 0.4909091 -0.3090909 0.4545455
[,9] [,10]
[1,] -0.4181818 -0.2242424
[2,] -0.4545455 -0.5393939
[3,] -0.5636364 -0.1757576
[4,] 0.4424242 0.3818182
[5,] -0.4060606 -0.1272727
[6,] 0.5878788 0.4909091
[7,] -0.2969697 -0.3090909
[8,] 0.7696970 0.4545455
[9,] 1.0000000 0.6242424
[10,] 0.6242424 1.0000000

output the matrix in a file

using rounded values

library(MASS)
write.matrix(round(correlation_matrix, digits=3), file="cor_matrix.txt")

If we look at the file cor_matrix.txt:

cat cor_matrix.txt
1.000 -0.103 0.055 -0.455 0.503 -0.321 0.115 -0.539 -0.418 -0.224
-0.103 1.000 0.152 -0.345 -0.055 -0.358 0.200 -0.224 -0.455 -0.539
0.055 0.152 1.000 0.236 0.333 -0.370 0.612 -0.212 -0.564 -0.176
-0.455 -0.345 0.236 1.000 -0.430 -0.176 -0.164 0.612 0.442 0.382
0.503 -0.055 0.333 -0.430 1.000 0.152 0.455 -0.297 -0.406 -0.127
-0.321 -0.358 -0.370 -0.176 0.152 1.000 0.200 0.224 0.588 0.491
0.115 0.200 0.612 -0.164 0.455 0.200 1.000 -0.358 -0.297 -0.309
-0.539 -0.224 -0.212 0.612 -0.297 0.224 -0.358 1.000 0.770 0.455
-0.418 -0.455 -0.564 0.442 -0.406 0.588 -0.297 0.770 1.000 0.624
-0.224 -0.539 -0.176 0.382 -0.127 0.491 -0.309 0.455 0.624 1.000

What about bigger matrices?

The computational time increases exponentially as we have more rows, since we are calculating the correlations between every row pair. For a matrix with 10005 rows and 40 columns, there would be:

choose(10005,2)
[1] 50045010

50,045,010 comparisons. It took ~33 hours to do all the calculations on one core of a Intel(R) Xeon(R) CPU X7560 @ 2.27GHz.

Let's create a plot of the number of rows versus the number of pairwise comparisons:

for a dataset with 10 rows

choose(10,2)
[1] 45

for 195,433

choose(195433,2)
[1] 19096931028

store comparisons in an array

row_number <- 1000000
x <- array(0, dim=c(row_number,1))
system.time(
for (i in 1:row_number){
x[i,1]<- choose(i,2)
}
)

user system elapsed

3.39 0.00 3.39

tail(x)
[,1]
[999995,] 499994500015
[999996,] 499995500010
[999997,] 499996500006
[999998,] 499997500003
[999999,] 499998500001
[1e+06,] 499999500000

plot(x,type="l")

number_of_comparison

As I mentioned above, it took roughly 33 hours to compute the correlations of a 10,005 by 40 matrix (50,045,010 comparisons) on one core.

We can speed this up by using more cores.

Parallel calculation of correlations

This fantastic guide on Parallel Computing in R shows how we can parallelise the calculations and I will follow their example. They used the test dataset included in the package Biobase:

install if necessary

source("http://bioconductor.org/biocLite.R"😉
biocLite("Biobase")

load the data

data(geneData, package = "Biobase")
dim(geneData)

[1] 500 26

how many comparisons?

choose(500,2)
[1] 124750

The comparisons go like this:

row1 by row2
row1 by row3
...

We can use the combn() function to generate the comparisons as a list:

for 3 rows

combn(3, 2, simplify=F)

[[1]]

[1] 1 2

[[2]]

[1] 1 3

[[3]]

[1] 2 3

for all rows of geneData

pair <- combn(1:nrow(geneData), 2, simplify = F)

now we need to create a function to calculate the correlations

taking the input of the object pair

geneCor <- function(x, gene = geneData) {
cor(gene[x[1], ], gene[x[2], ])
}

to find the correlation of row 1 and 5

geneCor(c(1,5))

[1] 0.6572192

all correlations, stored in out

system.time(out <- lapply(pair, geneCor))

user system elapsed

9.153 0.012 9.181

results

head(out, 3)
[[1]]
[1] 0.1536158

[[2]]
[1] 0.7066034

[[3]]
[1] -0.2125437

we can parallelise this by using

the package multicore

install if necessary

install.packages("multicore")

load library

library(multicore)

use the mclapply() function

system.time(out <- mclapply(pair, geneCor))

user system elapsed

11.083 0.561 1.206

corm <- cbind(do.call(rbind, pair), unlist(out))
write.table(corm, file = "correlation.tsv", quote = F, sep = "\t")

Using mclapply took ~1 sec compared to ~9.

Creating a co-expression network

In the R code above, I saved the correlation matrix using the function write.matrix(). Below is a simple Perl script that parses the output from write.matrix() and creates a sif file, which can then be loaded into Cytoscape. Please note that this script does not capture negative correlations, only positive correlations that are equal to or higher than the $threshold value.

!/usr/bin/perl

use strict;
use warnings;

my $usage = "Usage: $0 <infile.txt> <correlation>\n";
my $infile = shift or die $usage;
my $cor = shift or die $usage;

my @name = ();
my %sif = ();
my $counter = 0;
open(IN,'<',$infile) || die "Could not open $infile: $!\n";
while(<IN>){
chomp;
if ($. == 1){
@name = split(/\s+/);
next;
}
my $current = $. - 2;
my $counter = $counter + $. - 1;
my @cor = split();
print "$counter $current\n";
for (my $i=$counter; $i<scalar(@cor); ++$i){
if ($cor[$i] >= $cor){

print join("\t",$name[$current],$name[$i],$current,$i,$cor[$i]),"\n";

     print join("\t",$name[$current],'xx',$name[$i]),"\n";
  }

}
}
close(IN);

exit(0);

Basically the script examines the correlation matrix and prints out associations if the correlation between two rows is greater than $threshold. Here's how it looks when loaded into Cytoscape:

While most transcripts have a very similar expression pattern, several transcripts exhibit unique transcriptional responses.

Correlation network within R

An example of creating a correlation network was also available from the guide on Parallel Computing in R. Following on from above:

data(geneData, package = "Biobase")
pair <- combn(1:nrow(geneData), 2, simplify = F)
geneCor <- function(x, gene = geneData) {
cor(gene[x[1], ], gene[x[2], ])
}
out <- lapply(pair, geneCor)

head(out, n=3)

[[1]]

[1] 0.1536158

#

[[2]]

[1] 0.7066034

#

[[3]]

[1] -0.2125437

reshape the result

corm <- cbind(do.call(rbind, pair), unlist(out))
head(corm,n=3)
[,1] [,2] [,3]
[1,] 1 2 0.1536158
[2,] 1 3 0.7066034
[3,] 1 4 -0.2125437

remove low cor pairs

corm <- corm[abs(corm[,3]) >= 0.86, ]
dim(corm)

[1] 112 3

install packages if necessary

install.packages("network")
install.packages("sna")

library(network)
library(sna)

the network

net <- network(corm, directed = F)
net

Network attributes:

vertices = 498

directed = FALSE

hyper = FALSE

loops = FALSE

multiple = FALSE

bipartite = FALSE

total edges= 112

missing edges= 0

non-missing edges= 112

#

Vertex attribute names:

vertex.names

#

No edge attributes

component analysis

cd <- component.dist(net)

delete genes not connected with others

delete.vertices(net, which(cd$csize[cd$membership] == 1))

plot(net)

network

Using a real dataset

I will use the pnas_expression.txt file, which is a processed dataset from this this study.

data <-read.table("pnas_expression.txt",header=T,row.names=1)
dim(data)
[1] 37435 8
head(data)
lane1 lane2 lane3 lane4 lane5 lane6 lane8 len
ENSG00000215696 0 0 0 0 0 0 0 330
ENSG00000215700 0 0 0 0 0 0 0 2370
ENSG00000215699 0 0 0 0 0 0 0 1842
ENSG00000215784 0 0 0 0 0 0 0 2393
ENSG00000212914 0 0 0 0 0 0 0 384
ENSG00000212042 0 0 0 0 0 0 0 92

get rid of the len column

data <- data[,-data$len]

check how many rows are all zeros

table(rowSums(data)==0)

FALSE TRUE
21877 15558

create a smaller subset as an example

data_subset <- data[rowSums(data)>500,]
dim(data_subset)
[1] 4010 7

save row names

my_row <- rownames(data_subset)

convert into matrix

data_subset_matrix <- as.matrix(data_subset)

correlation of row 1 and row 2

cor(data_subset_matrix[1,],data_subset_matrix[2,],method="spearman")
[1] 0.5357143

correlation matrix

correlation_matrix <- cor(t(data_subset_matrix), method="spearman")
correlation_matrix[1,2]
[1] 0.5357143
dim(correlation_matrix)
[1] 4010 4010

output results

library(MASS)
write.matrix(correlation_matrix, file="pnas_expression_correlation.tsv")

The Perl script above can be used to convert the pnas_expression_correlation.tsv file into a sif file, which can then be loaded into Cytoscape.

Conclusions

Creating a correlation matrix with R is quite easy and as I have shown, the results can be visualised using Cytoscape. When applied to transcriptomic datasets, this may be useful in identifying co-expressed transcripts. I've shown an example of this using a real dataset, however note that in the example there are relatively few assays or samples. This may limit the usefulness of this approach since the number of transcriptional responses are smaller. However, imagine a dataset with a large number of assays, such as transcriptional profiling of a large panel of tissues. Using this approach, we may be able to unveil tissue specific transcripts since they will have very unique transcriptional responses.

Finally one can use the multicore package to speed up a large number of calculations.




Creative Commons License
This work is licensed under a Creative Commons
Attribution 4.0 International License
.
2 comments Add yours

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.