# Calculating Pearson correlation using Perl

My modification of code which is originally available here. Probably easier to understand the original code. I altered the code so that I could use an anonymous 2d array and with strictures, so that I could plug it into my own code.

Comments are included in the code below to assist use.

```#!/usr/bin/perl

use strict;
use warnings;

#make up some matrix for demonstration purposes
my \$x = [];
#generate 10 row
for (my \$i=1;\$i<11;++\$i){
#generate 2 columns, where column 1 = \$i and column 2 = \$i
for (my \$j=1;\$j<3;++\$j){
\$x->[\$i][\$j]= \$i;
}
}
#end matrix code

my \$correlation = correlation(\$x);
#Since the values in the columns are identical in \$x, the correlation will be 1
print "\$correlation\n";

#to use this code, remove the matrix code above
#generate an anonymous 2D array where \$x->[1] is the row
#\$x->[1][1] is the value in row 1 column 1 and \$x->[1][2] is the value of row 1 column 2
#once you build the entire array, pass it to the correlation subroutine as above
#my \$corrleation = correlation(\$x)

#if you want to see what's inside \$x use the code below
#for (my \$i = 1; \$i < scalar(@{\$x}); ++\$i){
#   my \$line_to_print = '';
#   for (my \$j = 1; \$j < scalar(@{\$x->[\$i]}); ++\$j){
#      \$line_to_print .= "\$x->[\$i]->[\$j]\t";
#   }
#   \$line_to_print =~ s/\t\$//;
#   print "\$line_to_print\n";
#}

sub mean {
my (\$x)=@_;
my \$num = scalar(@{\$x}) - 1;
my \$sum_x = '0';
my \$sum_y = '0';
for (my \$i = 1; \$i < scalar(@{\$x}); ++\$i){
\$sum_x += \$x->[\$i][1];
\$sum_y += \$x->[\$i][2];
}
my \$mu_x = \$sum_x / \$num;
my \$mu_y = \$sum_y / \$num;
return(\$mu_x,\$mu_y);
}

### ss = sum of squared deviations to the mean
sub ss {
my (\$x,\$mean_x,\$mean_y,\$one,\$two)=@_;
my \$sum = '0';
for (my \$i=1;\$i<scalar(@{\$x});++\$i){
\$sum += (\$x->[\$i][\$one]-\$mean_x)*(\$x->[\$i][\$two]-\$mean_y);
}
return \$sum;
}

sub correlation {
my (\$x) = @_;
my (\$mean_x,\$mean_y) = mean(\$x);
my \$ssxx=ss(\$x,\$mean_x,\$mean_y,1,1);
my \$ssyy=ss(\$x,\$mean_x,\$mean_y,2,2);
my \$ssxy=ss(\$x,\$mean_x,\$mean_y,1,2);
my \$correl=correl(\$ssxx,\$ssyy,\$ssxy);
my \$xcorrel=sprintf("%.4f",\$correl);
return(\$xcorrel);

}

sub correl {
my(\$ssxx,\$ssyy,\$ssxy)=@_;
my \$sign=\$ssxy/abs(\$ssxy);
my \$correl=\$sign*sqrt(\$ssxy*\$ssxy/(\$ssxx*\$ssyy));
return \$correl;
}
```

.
1. Mark says:

for (my \$i=1;\$i[\$i][\$one]-\$mean_x)*(\$x->[\$i][\$two]-\$mean_y);
}

…. ?

1. Davo says:

Thanks Mark. The code wasn’t showing properly before.

2. Mark says:

No problem.

I also made re-write of the original code because I kept getting correl=1.00 for everything when I tried to do this myself. The code is here:
http://pastebin.com/i5m4qa99

1. Davo says:

Thanks for sharing ðŸ™‚

2. Shivalika says:

Can you please tell me about the format of both input files. query and annotaion files

3. Mark says:

note: I shift off the first element of the arrays because they contained array id’s in my data files.

4. Roy says:

Can one also get a p-value from this?

1. Davo says:

Hi Roy,

The code will only return the Pearson correlation coefficient.

Cheers,

Dave

5. Rob says:

Shouldn’t this:

my \$ssxx=ss(\$x,\$mean_x,\$mean_y,1,1);
my \$ssyy=ss(\$x,\$mean_x,\$mean_y,2,2);

be this:

my \$ssxx=ss(\$x,\$mean_x,\$mean_x,1,1);
my \$ssyy=ss(\$x,\$mean_y,\$mean_y,2,2);

That’s how it looks to be in your original code.

Rob

1. Davo says:

Hi Rob,

When you said “your original code”, did you mean the link I reference right at the start of this post (http://homepages.ulb.ac.be/~dgonze/SCRIPTS/PERL/correlation.pl)? If so, the code was written by someone else; I merely just modified it so I get the correlation value by passing an array reference.

If you were referring to that particular original code, then here’s my answer. The original code did not use strictures (i.e. use strict) and thus the ss() function could access the values of \$mean[1] and \$mean[2] globally. What I did was that I confined the mean values within the correlation() function, thus the ss() function is unable to access the values. Therefore I have to pass the mean values to via ss(\$x,\$mean_x,\$mean_y,1,1) and the order of \$mean_x and \$mean_y would be same each time I call the ss() function.

Cheers,

Dave

This site uses Akismet to reduce spam. Learn how your comment data is processed.