# Pearson vs. Spearman correlation

Correlation measures are commonly used to show how correlated two sets of datasets are. A commonly used measure is the Pearson correlation. To illustrate when not to use a Pearson correlation:

```x = c(55,70,33,100,99,15,2,1,5,2000)
y = c(2,10,88,20,30,88,23,49,40,2000)

cor(x,y,method="pearson")
 0.9957008
cor(x,y,method="spearman")
 -0.07294867
cor(log(x),log(y),method="pearson")
 0.3556905

summary(x)
Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
1.00    7.50   44.00  238.00   91.75 2000.00
summary(y)
Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
2.00   20.75   35.00  235.00   78.25 2000.00

sd(x)
 620.2714
sd(y)
 620.8571
```

If we remove the 2,000 value:

```x1 = c(55,70,33,100,99,15,2,1,5)
y1 = c(2,10,88,20,30,88,23,49,40)
cor(x1,y1,method="pearson")
 -0.4440288
cor(x1,y1,method="spearman")
 -0.4769916
```

Use a non-parametric correlation (e.g. Spearman’s rank) measure if your dataset has outliers. It would probably be best to remove the outlier, since the negative correlation is further revealed afterwards in the Spearman’s rank. 