# Creating a correlation matrix with R

Updated 2024 April 7th

## Incentive

Let $A$ be a $m \times n$ matrix, where $a_{ij}$ are elements of $A$, where $i$ is the $i_{th}$ row and $j$ is the $j_{th}$ column.

$A = \begin{bmatrix} a_{11} & \cdots & a_{1j} & \cdots & a_{1n} \\\\ \vdots & \ddots & \vdots && \vdots \\\\ a_{i1} & \cdots & a_{ij} & \cdots & a_{in} \\\\ \vdots && \vdots & \ddots & \vdots \\\\ a_{m1} & \cdots & a_{mj} & \cdots & a_{mn} \end{bmatrix}$

If the matrix $A$ contained transcript expression data, then $a_{ij}$ is the expression level of the $i^{th}$ transcript in the $j^{th}$ assay. The elements of the $i^{th}$ row of $A$ form the transcriptional response of the $i^{th}$ transcript. The elements of the $j^{th}$ column of $A$ form the expression profile of the $j^{th}$ assay.

Transcripts that have a similar transcriptional response may indicate that they are co-expressed together and could have related biological functions. A simple way of looking at co-expression is through correlation, i.e., correlating all pairs of transcriptional responses, which results in a correlation matrix.

### Getting started

Let's get started with a simple example, with a $$10 \times 10$$ matrix.

# create random matrix with numbers ranging from 1 to 100

set.seed(12345)
A <- matrix(runif(100,1,100),nrow=10,ncol=10,byrow=T)
image(A)

Heatmap of $$A$$

Now to calculate the correlations, using Spearman's rank correlation coefficient, of each row by each row using R and storing the results in a correlation matrix.

# our "expression" matrix

A
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 72.369486 87.701546 76.337251 88.726332 46.191615 17.47081 33.184443 51.413209 73.042820 98.983957
[2,] 4.419008 16.084976 73.832810 1.112522 39.729130 46.78697 39.426254 40.846029 18.717395 95.214217
[3,] 45.919079 33.348488 96.576117 71.040706 64.809721 39.59302 70.155820 54.861729 23.420251 48.971218
[4,] 79.507710 1.592775 19.583532 68.501529 37.640308 36.80093 87.010695 90.511312 62.125032 14.269132
[5,] 78.437135 43.490683 92.800124 77.551079 26.708443 32.80124 6.959321 5.302189 6.450328 62.928737
[6,] 96.482559 82.902984 32.187795 22.089520 73.517116 50.42486 73.247425 8.953268 44.117518 24.421465
[7,] 79.365212 26.609747 98.612399 75.930501 97.998046 22.67584 94.922011 15.796337 60.435340 94.696644
[8,] 69.146982 51.047839 37.920688 34.244699 5.776884 62.27581 96.183282 65.841090 51.518907 15.859723
[9,] 87.174340 51.929726 1.856143 2.900282 15.306678 31.19814 82.740029 50.732120 80.553690 7.003358
[10,] 92.867559 81.009677 8.802521 60.491900 71.763208 51.87072 72.291534 75.244681 10.468412 40.384761

# row 1

A[1,]
[1] 72.36949 87.70155 76.33725 88.72633 46.19162 17.47081 33.18444 51.41321 73.04282 98.98396

# row 2

A[2,]
[1] 4.419008 16.084976 73.832810 1.112522 39.729130 46.786971 39.426254 40.846029 18.717395 95.214217

# calculating the Spearman's correlation between row 1 and row 2

cor(A[1,], A[2,], method="spearman")
[1] -0.1030303

# calculating all row by row correlations, remember the transpose t()

correlation_matrix <- cor(t(A), method="spearman")
correlation_matrix
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
[1,] 1.00000000 -0.10303030 0.05454545 -0.4545455 0.50303030 -0.3212121 0.1151515 -0.5393939
[2,] -0.10303030 1.00000000 0.15151515 -0.3454545 -0.05454545 -0.3575758 0.2000000 -0.2242424
[3,] 0.05454545 0.15151515 1.00000000 0.2363636 0.33333333 -0.3696970 0.6121212 -0.2121212
[4,] -0.45454545 -0.34545455 0.23636364 1.0000000 -0.43030303 -0.1757576 -0.1636364 0.6121212
[5,] 0.50303030 -0.05454545 0.33333333 -0.4303030 1.00000000 0.1515152 0.4545455 -0.2969697
[6,] -0.32121212 -0.35757576 -0.36969697 -0.1757576 0.15151515 1.0000000 0.2000000 0.2242424
[7,] 0.11515152 0.20000000 0.61212121 -0.1636364 0.45454545 0.2000000 1.0000000 -0.3575758
[8,] -0.53939394 -0.22424242 -0.21212121 0.6121212 -0.29696970 0.2242424 -0.3575758 1.0000000
[9,] -0.41818182 -0.45454545 -0.56363636 0.4424242 -0.40606061 0.5878788 -0.2969697 0.7696970
[10,] -0.22424242 -0.53939394 -0.17575758 0.3818182 -0.12727273 0.4909091 -0.3090909 0.4545455
[,9] [,10]
[1,] -0.4181818 -0.2242424
[2,] -0.4545455 -0.5393939
[3,] -0.5636364 -0.1757576
[4,] 0.4424242 0.3818182
[5,] -0.4060606 -0.1272727
[6,] 0.5878788 0.4909091
[7,] -0.2969697 -0.3090909
[8,] 0.7696970 0.4545455
[9,] 1.0000000 0.6242424
[10,] 0.6242424 1.0000000

# using rounded values

library(MASS)
write.matrix(round(correlation_matrix, digits=3), file="cor_matrix.txt")

If we look at the file cor_matrix.txt:

cat cor_matrix.txt
1.000 -0.103 0.055 -0.455 0.503 -0.321 0.115 -0.539 -0.418 -0.224
-0.103 1.000 0.152 -0.345 -0.055 -0.358 0.200 -0.224 -0.455 -0.539
0.055 0.152 1.000 0.236 0.333 -0.370 0.612 -0.212 -0.564 -0.176
-0.455 -0.345 0.236 1.000 -0.430 -0.176 -0.164 0.612 0.442 0.382
0.503 -0.055 0.333 -0.430 1.000 0.152 0.455 -0.297 -0.406 -0.127
-0.321 -0.358 -0.370 -0.176 0.152 1.000 0.200 0.224 0.588 0.491
0.115 0.200 0.612 -0.164 0.455 0.200 1.000 -0.358 -0.297 -0.309
-0.539 -0.224 -0.212 0.612 -0.297 0.224 -0.358 1.000 0.770 0.455
-0.418 -0.455 -0.564 0.442 -0.406 0.588 -0.297 0.770 1.000 0.624
-0.224 -0.539 -0.176 0.382 -0.127 0.491 -0.309 0.455 0.624 1.000

The computational time increases exponentially as we have more rows, since we are calculating the correlations between every row pair. For a matrix with 10005 rows and 40 columns, there would be:

choose(10005,2)
[1] 50045010

50,045,010 comparisons. It took ~33 hours to do all the calculations on one core of a Intel(R) Xeon(R) CPU X7560 @ 2.27GHz.

Let's create a plot of the number of rows versus the number of pairwise comparisons:

choose(10,2)
[1] 45

choose(195433,2)
[1] 19096931028

# store comparisons in an array

row_number <- 1000000
x <- array(0, dim=c(row_number,1))
system.time(
for (i in 1:row_number){
x[i,1]<- choose(i,2)
}
)

# 3.39 0.00 3.39

tail(x)
[,1]
[999995,] 499994500015
[999996,] 499995500010
[999997,] 499996500006
[999998,] 499997500003
[999999,] 499998500001
[1e+06,] 499999500000

plot(x,type="l")

As I mentioned above, it took roughly 33 hours to compute the correlations of a 10,005 by 40 matrix (50,045,010 comparisons) on one core.

We can speed this up by using more cores.

### Parallel calculation of correlations

This fantastic guide on Parallel Computing in R shows how we can parallelise the calculations and I will follow their example. They used the test dataset included in the package Biobase:

# install if necessary

source("http://bioconductor.org/biocLite.R"ðŸ˜‰
biocLite("Biobase")

data(geneData, package = "Biobase")
dim(geneData)

# how many comparisons?

choose(500,2)
[1] 124750

The comparisons go like this:

row1 by row2
row1 by row3
...

We can use the combn() function to generate the comparisons as a list:

# for 3 rows

combn(3, 2, simplify=F)

# for all rows of geneData

pair <- combn(1:nrow(geneData), 2, simplify = F)

# taking the input of the object pair

geneCor <- function(x, gene = geneData) {
cor(gene[x[1], ], gene[x[2], ])
}

geneCor(c(1,5))

# all correlations, stored in out

system.time(out <- lapply(pair, geneCor))

[[1]]
[1] 0.1536158

[[2]]
[1] 0.7066034

[[3]]
[1] -0.2125437

# install if necessary

install.packages("multicore")

library(multicore)

# use the mclapply() function

system.time(out <- mclapply(pair, geneCor))

# 11.083 0.561 1.206

corm <- cbind(do.call(rbind, pair), unlist(out))
write.table(corm, file = "correlation.tsv", quote = F, sep = "\t")

Using mclapply took ~1 sec compared to ~9.

In the R code above, I saved the correlation matrix using the function write.matrix(). Below is a simple Perl script that parses the output from write.matrix() and creates a sif file, which can then be loaded into Cytoscape. Please note that this script does not capture negative correlations, only positive correlations that are equal to or higher than the $threshold value. # !/usr/bin/perl use strict; use warnings; my$usage = "Usage: $0 <infile.txt> <correlation>\n"; my$infile = shift or die $usage; my$cor = shift or die $usage; my @name = (); my %sif = (); my$counter = 0;
open(IN,'<',$infile) || die "Could not open$infile: $!\n"; while(<IN>){ chomp; if ($. == 1){
@name = split(/\s+/);
next;
}
my $current =$. - 2;
my $counter =$counter + $. - 1; my @cor = split(); print "$counter $current\n"; for (my$i=$counter;$i<scalar(@cor); ++$i){ if ($cor[$i] >=$cor){

# print join("\t",$name[$current],$name[$i],$current,$i,$cor[$i]),"\n";

     print join("\t",$name[$current],'xx',$name[$i]),"\n";
}

}
}
close(IN);

exit(0);

# check how many rows are all zeros

table(rowSums(data)==0)

FALSE TRUE
21877 15558

# create a smaller subset as an example

data_subset <- data[rowSums(data)>500,]
dim(data_subset)
[1] 4010 7

# save row names

my_row <- rownames(data_subset)

# convert into matrix

data_subset_matrix <- as.matrix(data_subset)

# correlation of row 1 and row 2

cor(data_subset_matrix[1,],data_subset_matrix[2,],method="spearman")
[1] 0.5357143

# correlation matrix

correlation_matrix <- cor(t(data_subset_matrix), method="spearman")
correlation_matrix[1,2]
[1] 0.5357143
dim(correlation_matrix)
[1] 4010 4010

# output results

library(MASS)
write.matrix(correlation_matrix, file="pnas_expression_correlation.tsv")

The Perl script above can be used to convert the pnas_expression_correlation.tsv file into a sif file, which can then be loaded into Cytoscape.

### Conclusions

Creating a correlation matrix with R is quite easy and as I have shown, the results can be visualised using Cytoscape. When applied to transcriptomic datasets, this may be useful in identifying co-expressed transcripts. I've shown an example of this using a real dataset, however note that in the example there are relatively few assays or samples. This may limit the usefulness of this approach since the number of transcriptional responses are smaller. However, imagine a dataset with a large number of assays, such as transcriptional profiling of a large panel of tissues. Using this approach, we may be able to unveil tissue specific transcripts since they will have very unique transcriptional responses.

Finally one can use the multicore package to speed up a large number of calculations.