Updated 2024 April 7th
Incentive
Let be a matrix, where are elements of , where is the row and is the column.
If the matrix contained transcript expression data, then is the expression level of the transcript in the assay. The elements of the row of form the transcriptional response of the transcript. The elements of the column of form the expression profile of the assay.
Transcripts that have a similar transcriptional response may indicate that they are co-expressed together and could have related biological functions. A simple way of looking at co-expression is through correlation, i.e., correlating all pairs of transcriptional responses, which results in a correlation matrix.
Getting started
Let's get started with a simple example, with a $$10 \times 10$$ matrix.
create random matrix with numbers ranging from 1 to 100
set.seed(12345)
A <- matrix(runif(100,1,100),nrow=10,ncol=10,byrow=T)
image(A)
Now to calculate the correlations, using Spearman's rank correlation coefficient, of each row by each row using R and storing the results in a correlation matrix.
our "expression" matrix
A
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 72.369486 87.701546 76.337251 88.726332 46.191615 17.47081 33.184443 51.413209 73.042820 98.983957
[2,] 4.419008 16.084976 73.832810 1.112522 39.729130 46.78697 39.426254 40.846029 18.717395 95.214217
[3,] 45.919079 33.348488 96.576117 71.040706 64.809721 39.59302 70.155820 54.861729 23.420251 48.971218
[4,] 79.507710 1.592775 19.583532 68.501529 37.640308 36.80093 87.010695 90.511312 62.125032 14.269132
[5,] 78.437135 43.490683 92.800124 77.551079 26.708443 32.80124 6.959321 5.302189 6.450328 62.928737
[6,] 96.482559 82.902984 32.187795 22.089520 73.517116 50.42486 73.247425 8.953268 44.117518 24.421465
[7,] 79.365212 26.609747 98.612399 75.930501 97.998046 22.67584 94.922011 15.796337 60.435340 94.696644
[8,] 69.146982 51.047839 37.920688 34.244699 5.776884 62.27581 96.183282 65.841090 51.518907 15.859723
[9,] 87.174340 51.929726 1.856143 2.900282 15.306678 31.19814 82.740029 50.732120 80.553690 7.003358
[10,] 92.867559 81.009677 8.802521 60.491900 71.763208 51.87072 72.291534 75.244681 10.468412 40.384761
row 1
A[1,]
[1] 72.36949 87.70155 76.33725 88.72633 46.19162 17.47081 33.18444 51.41321 73.04282 98.98396
row 2
A[2,]
[1] 4.419008 16.084976 73.832810 1.112522 39.729130 46.786971 39.426254 40.846029 18.717395 95.214217
calculating the Spearman's correlation between row 1 and row 2
cor(A[1,], A[2,], method="spearman")
[1] -0.1030303
calculating all row by row correlations, remember the transpose t()
correlation_matrix <- cor(t(A), method="spearman")
correlation_matrix
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
[1,] 1.00000000 -0.10303030 0.05454545 -0.4545455 0.50303030 -0.3212121 0.1151515 -0.5393939
[2,] -0.10303030 1.00000000 0.15151515 -0.3454545 -0.05454545 -0.3575758 0.2000000 -0.2242424
[3,] 0.05454545 0.15151515 1.00000000 0.2363636 0.33333333 -0.3696970 0.6121212 -0.2121212
[4,] -0.45454545 -0.34545455 0.23636364 1.0000000 -0.43030303 -0.1757576 -0.1636364 0.6121212
[5,] 0.50303030 -0.05454545 0.33333333 -0.4303030 1.00000000 0.1515152 0.4545455 -0.2969697
[6,] -0.32121212 -0.35757576 -0.36969697 -0.1757576 0.15151515 1.0000000 0.2000000 0.2242424
[7,] 0.11515152 0.20000000 0.61212121 -0.1636364 0.45454545 0.2000000 1.0000000 -0.3575758
[8,] -0.53939394 -0.22424242 -0.21212121 0.6121212 -0.29696970 0.2242424 -0.3575758 1.0000000
[9,] -0.41818182 -0.45454545 -0.56363636 0.4424242 -0.40606061 0.5878788 -0.2969697 0.7696970
[10,] -0.22424242 -0.53939394 -0.17575758 0.3818182 -0.12727273 0.4909091 -0.3090909 0.4545455
[,9] [,10]
[1,] -0.4181818 -0.2242424
[2,] -0.4545455 -0.5393939
[3,] -0.5636364 -0.1757576
[4,] 0.4424242 0.3818182
[5,] -0.4060606 -0.1272727
[6,] 0.5878788 0.4909091
[7,] -0.2969697 -0.3090909
[8,] 0.7696970 0.4545455
[9,] 1.0000000 0.6242424
[10,] 0.6242424 1.0000000
output the matrix in a file
using rounded values
library(MASS)
write.matrix(round(correlation_matrix, digits=3), file="cor_matrix.txt")
If we look at the file cor_matrix.txt:
cat cor_matrix.txt
1.000 -0.103 0.055 -0.455 0.503 -0.321 0.115 -0.539 -0.418 -0.224
-0.103 1.000 0.152 -0.345 -0.055 -0.358 0.200 -0.224 -0.455 -0.539
0.055 0.152 1.000 0.236 0.333 -0.370 0.612 -0.212 -0.564 -0.176
-0.455 -0.345 0.236 1.000 -0.430 -0.176 -0.164 0.612 0.442 0.382
0.503 -0.055 0.333 -0.430 1.000 0.152 0.455 -0.297 -0.406 -0.127
-0.321 -0.358 -0.370 -0.176 0.152 1.000 0.200 0.224 0.588 0.491
0.115 0.200 0.612 -0.164 0.455 0.200 1.000 -0.358 -0.297 -0.309
-0.539 -0.224 -0.212 0.612 -0.297 0.224 -0.358 1.000 0.770 0.455
-0.418 -0.455 -0.564 0.442 -0.406 0.588 -0.297 0.770 1.000 0.624
-0.224 -0.539 -0.176 0.382 -0.127 0.491 -0.309 0.455 0.624 1.000
What about bigger matrices?
The computational time increases exponentially as we have more rows, since we are calculating the correlations between every row pair. For a matrix with 10005 rows and 40 columns, there would be:
choose(10005,2)
[1] 50045010
50,045,010 comparisons. It took ~33 hours to do all the calculations on one core of a Intel(R) Xeon(R) CPU X7560 @ 2.27GHz.
Let's create a plot of the number of rows versus the number of pairwise comparisons:
for a dataset with 10 rows
choose(10,2)
[1] 45
for 195,433
choose(195433,2)
[1] 19096931028
store comparisons in an array
row_number <- 1000000
x <- array(0, dim=c(row_number,1))
system.time(
for (i in 1:row_number){
x[i,1]<- choose(i,2)
}
)
user system elapsed
3.39 0.00 3.39
tail(x)
[,1]
[999995,] 499994500015
[999996,] 499995500010
[999997,] 499996500006
[999998,] 499997500003
[999999,] 499998500001
[1e+06,] 499999500000
plot(x,type="l")
As I mentioned above, it took roughly 33 hours to compute the correlations of a 10,005 by 40 matrix (50,045,010 comparisons) on one core.
We can speed this up by using more cores.
Parallel calculation of correlations
This fantastic guide on Parallel Computing in R shows how we can parallelise the calculations and I will follow their example. They used the test dataset included in the package Biobase:
install if necessary
source("http://bioconductor.org/biocLite.R"😉
biocLite("Biobase")
load the data
data(geneData, package = "Biobase")
dim(geneData)
[1] 500 26
how many comparisons?
choose(500,2)
[1] 124750
The comparisons go like this:
row1 by row2
row1 by row3
...
We can use the combn() function to generate the comparisons as a list:
for 3 rows
combn(3, 2, simplify=F)
[[1]]
[1] 1 2
[[2]]
[1] 1 3
[[3]]
[1] 2 3
for all rows of geneData
pair <- combn(1:nrow(geneData), 2, simplify = F)
now we need to create a function to calculate the correlations
taking the input of the object pair
geneCor <- function(x, gene = geneData) {
cor(gene[x[1], ], gene[x[2], ])
}
to find the correlation of row 1 and 5
geneCor(c(1,5))
[1] 0.6572192
all correlations, stored in out
system.time(out <- lapply(pair, geneCor))
user system elapsed
9.153 0.012 9.181
results
head(out, 3)
[[1]]
[1] 0.1536158
[[2]]
[1] 0.7066034
[[3]]
[1] -0.2125437
we can parallelise this by using
the package multicore
install if necessary
install.packages("multicore")
load library
library(multicore)
use the mclapply() function
system.time(out <- mclapply(pair, geneCor))
user system elapsed
11.083 0.561 1.206
corm <- cbind(do.call(rbind, pair), unlist(out))
write.table(corm, file = "correlation.tsv", quote = F, sep = "\t")
Using mclapply took ~1 sec compared to ~9.
Creating a co-expression network
In the R code above, I saved the correlation matrix using the function write.matrix(). Below is a simple Perl script that parses the output from write.matrix() and creates a sif file, which can then be loaded into Cytoscape. Please note that this script does not capture negative correlations, only positive correlations that are equal to or higher than the $threshold value.
!/usr/bin/perl
use strict;
use warnings;
my $usage = "Usage: $0 <infile.txt> <correlation>\n";
my $infile = shift or die $usage;
my $cor = shift or die $usage;
my @name = ();
my %sif = ();
my $counter = 0;
open(IN,'<',$infile) || die "Could not open $infile: $!\n";
while(<IN>){
chomp;
if ($. == 1){
@name = split(/\s+/);
next;
}
my $current = $. - 2;
my $counter = $counter + $. - 1;
my @cor = split();
print "$counter $current\n";
for (my $i=$counter; $i<scalar(@cor); ++$i){
if ($cor[$i] >= $cor){
print join("\t",$name[$current],$name[$i],$current,$i,$cor[$i]),"\n";
print join("\t",$name[$current],'xx',$name[$i]),"\n";
}
}
}
close(IN);
exit(0);
Basically the script examines the correlation matrix and prints out associations if the correlation between two rows is greater than $threshold. Here's how it looks when loaded into Cytoscape:
While most transcripts have a very similar expression pattern, several transcripts exhibit unique transcriptional responses.
Correlation network within R
An example of creating a correlation network was also available from the guide on Parallel Computing in R. Following on from above:
data(geneData, package = "Biobase")
pair <- combn(1:nrow(geneData), 2, simplify = F)
geneCor <- function(x, gene = geneData) {
cor(gene[x[1], ], gene[x[2], ])
}
out <- lapply(pair, geneCor)
head(out, n=3)
[[1]]
[1] 0.1536158
#
[[2]]
[1] 0.7066034
#
[[3]]
[1] -0.2125437
reshape the result
corm <- cbind(do.call(rbind, pair), unlist(out))
head(corm,n=3)
[,1] [,2] [,3]
[1,] 1 2 0.1536158
[2,] 1 3 0.7066034
[3,] 1 4 -0.2125437
remove low cor pairs
corm <- corm[abs(corm[,3]) >= 0.86, ]
dim(corm)
[1] 112 3
install packages if necessary
install.packages("network")
install.packages("sna")
library(network)
library(sna)
the network
net <- network(corm, directed = F)
net
Network attributes:
vertices = 498
directed = FALSE
hyper = FALSE
loops = FALSE
multiple = FALSE
bipartite = FALSE
total edges= 112
missing edges= 0
non-missing edges= 112
#
Vertex attribute names:
vertex.names
#
No edge attributes
component analysis
cd <- component.dist(net)
delete genes not connected with others
delete.vertices(net, which(cd$csize[cd$membership] == 1))
plot(net)
Using a real dataset
I will use the pnas_expression.txt file, which is a processed dataset from this this study.
data <-read.table("pnas_expression.txt",header=T,row.names=1)
dim(data)
[1] 37435 8
head(data)
lane1 lane2 lane3 lane4 lane5 lane6 lane8 len
ENSG00000215696 0 0 0 0 0 0 0 330
ENSG00000215700 0 0 0 0 0 0 0 2370
ENSG00000215699 0 0 0 0 0 0 0 1842
ENSG00000215784 0 0 0 0 0 0 0 2393
ENSG00000212914 0 0 0 0 0 0 0 384
ENSG00000212042 0 0 0 0 0 0 0 92
get rid of the len column
data <- data[,-data$len]
check how many rows are all zeros
table(rowSums(data)==0)
FALSE TRUE
21877 15558
create a smaller subset as an example
data_subset <- data[rowSums(data)>500,]
dim(data_subset)
[1] 4010 7
save row names
my_row <- rownames(data_subset)
convert into matrix
data_subset_matrix <- as.matrix(data_subset)
correlation of row 1 and row 2
cor(data_subset_matrix[1,],data_subset_matrix[2,],method="spearman")
[1] 0.5357143
correlation matrix
correlation_matrix <- cor(t(data_subset_matrix), method="spearman")
correlation_matrix[1,2]
[1] 0.5357143
dim(correlation_matrix)
[1] 4010 4010
output results
library(MASS)
write.matrix(correlation_matrix, file="pnas_expression_correlation.tsv")
The Perl script above can be used to convert the pnas_expression_correlation.tsv file into a sif file, which can then be loaded into Cytoscape.
Conclusions
Creating a correlation matrix with R is quite easy and as I have shown, the results can be visualised using Cytoscape. When applied to transcriptomic datasets, this may be useful in identifying co-expressed transcripts. I've shown an example of this using a real dataset, however note that in the example there are relatively few assays or samples. This may limit the usefulness of this approach since the number of transcriptional responses are smaller. However, imagine a dataset with a large number of assays, such as transcriptional profiling of a large panel of tissues. Using this approach, we may be able to unveil tissue specific transcripts since they will have very unique transcriptional responses.
Finally one can use the multicore package to speed up a large number of calculations.
This work is licensed under a Creative Commons
Attribution 4.0 International License.
I find your musings most helpful! Thanks for sharing them.
You’re welcome!