Pearson vs. Spearman correlation

Correlation measures are commonly used to show how correlated two sets of datasets are. A commonly used measure is the Pearson correlation. To illustrate when not to use a Pearson correlation:

x = c(55,70,33,100,99,15,2,1,5,2000)
y = c(2,10,88,20,30,88,23,49,40,2000)

cor(x,y,method="pearson")
[1] 0.9957008
cor(x,y,method="spearman")
[1] -0.07294867
cor(log(x),log(y),method="pearson")
[1] 0.3556905

summary(x)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
   1.00    7.50   44.00  238.00   91.75 2000.00
summary(y)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
   2.00   20.75   35.00  235.00   78.25 2000.00

sd(x)
[1] 620.2714
sd(y)
[1] 620.8571

If we remove the 2,000 value:

x1 = c(55,70,33,100,99,15,2,1,5)
y1 = c(2,10,88,20,30,88,23,49,40)
cor(x1,y1,method="pearson")
[1] -0.4440288
cor(x1,y1,method="spearman")
[1] -0.4769916

Use a non-parametric correlation (e.g. Spearman's rank) measure if your dataset has outliers. It would probably be best to remove the outlier, since the negative correlation is further revealed afterwards in the Spearman's rank.

This work is licensed under a Creative Commons
Attribution 4.0 International License.

Leave a Reply Cancel reply