Correlation measures are commonly used to show how correlated two sets of datasets are. A commonly used measure is the Pearson correlation. To illustrate when not to use a Pearson correlation:
x = c(55,70,33,100,99,15,2,1,5,2000) y = c(2,10,88,20,30,88,23,49,40,2000) cor(x,y,method="pearson") [1] 0.9957008 cor(x,y,method="spearman") [1] -0.07294867 cor(log(x),log(y),method="pearson") [1] 0.3556905 summary(x) Min. 1st Qu. Median Mean 3rd Qu. Max. 1.00 7.50 44.00 238.00 91.75 2000.00 summary(y) Min. 1st Qu. Median Mean 3rd Qu. Max. 2.00 20.75 35.00 235.00 78.25 2000.00 sd(x) [1] 620.2714 sd(y) [1] 620.8571
If we remove the 2,000 value:
x1 = c(55,70,33,100,99,15,2,1,5) y1 = c(2,10,88,20,30,88,23,49,40) cor(x1,y1,method="pearson") [1] -0.4440288 cor(x1,y1,method="spearman") [1] -0.4769916
Use a non-parametric correlation (e.g. Spearman's rank) measure if your dataset has outliers. It would probably be best to remove the outlier, since the negative correlation is further revealed afterwards in the Spearman's rank.
This work is licensed under a Creative Commons
Attribution 4.0 International License.