Updated 2017 September 7th
The Kolmogorov-Smirnov test can be used to test whether two underlying one-dimensional probability distributions differ. As noted in the Wikipedia article:
Note that the two-sample test checks whether the two data samples come from the same distribution. This does not specify what that common distribution is (e.g. whether it's normal or not normal).
Let's generate two samples that follow the Poisson distribution with the same parameters.
# example 1 set.seed(123) x <- rpois(n=1000, lambda=100) set.seed(234) y <- rpois(n=1000, lambda=100) # perform a two-sample Kolmogorov-Smirnov test ks.test(x,y) Two-sample Kolmogorov-Smirnov test data: x and y D = 0.036, p-value = 0.5361 alternative hypothesis: two-sided Warning message: In ks.test(x, y) : p-value will be approximate in the presence of ties
The null hypothesis is that both samples come from the same distribution and is not rejected (p-value = 0.5361) since they do come from the exact same distribution.
The warning message is due to the implementation of the KS test in R, which expects a continuous distribution and thus there should not be any identical values in the two datasets i.e. ties. I've read several sources and they all mention that the KS test can deal with both discrete and continuous data (I'm guessing because it mainly deals with cumulative quantiles) but I'm not sure about the implementation in R. For more information, see this page.
In the example below, some variance is added using the jitter() function.
# example 2 set.seed(123) x <- rpois(n=1000, lambda=100) summary(x) Min. 1st Qu. Median Mean 3rd Qu. Max. 66.0 93.0 99.5 99.5 106.0 130.0 # add some noise set.seed(123) x <- jitter(x, factor=100) summary(x) Min. 1st Qu. Median Mean 3rd Qu. Max. 52.30 88.82 99.71 99.40 109.43 148.52 set.seed(234) y <- rpois(n=1000, lambda=100) set.seed(234) y <- jitter(y, factor=100) ks.test(x,y) Two-sample Kolmogorov-Smirnov test data: x and y D = 0.027, p-value = 0.8593 alternative hypothesis: two-sided plot(ecdf(x = x), main = "ECDF of x and y") lines(ecdf(x = y), col = 2)
Despite the noise, the two distributions are quite similar.
Let's perform the test using two samples following different distributions.
set.seed(123) x <- rpois(n=1000, lambda=100) set.seed(123) x <- jitter(x, factor=100) set.seed(234) y <- rnorm(n=1000, mean=100) set.seed(234) y <- jitter(y, factor=100) ks.test(x,y) Two-sample Kolmogorov-Smirnov test data: x and y D = 0.439, p-value < 2.2e-16 alternative hypothesis: two-sided plot(ecdf(x = x), main = "ECDF of x and y") lines(ecdf(x = y), col = 2)
The two samples clearly have very different distributions.
In the example above, we get an extremely low p-value and we can reject the null hypothesis that both samples come from the same distribution, which they clearly don't.
Testing for normality
Many statistical tests assume that the data is normally distributed. One approach for determining normality is using a QQ-normal plot.
set.seed(123) norm_data <- rnorm(n=1000, mean=200, sd=50) qqnorm(norm_data, pch = 16) qqline(norm_data, col = 2)
The points fall on the line that follows a normal distribution.
set.seed(123) gamma_data <- rgamma(n=1000, shape=1) qqnorm(gamma_data, pch = 16) qqline(gamma_data, col = 2)
The points do not fall on the line that follows a normal distribution.
The Shapiro-Wilk test can test whether a sample comes from a normally distributed population.
shapiro.test(norm_data) Shapiro-Wilk normality test data: norm_data W = 0.9984, p-value = 0.4765
The null hypothesis is that our sample is normally distributed and we cannot reject the null hypothesis (p-value = 0.4765). What about our sample generated using the Gamma distribution?
shapiro.test(gamma_data) Shapiro-Wilk normality test data: gamma_data W = 0.8146, p-value < 2.2e-16
We can reject the null hypothesis, since gamma_data is not normally distributed.
Finally, using the two sample Kolmogorov–Smirnov test on norm_data and gamma_data.
ks.test(norm_data, gamma_data) Two-sample Kolmogorov-Smirnov test data: norm_data and gamma_data D = 1, p-value < 2.2e-16 alternative hypothesis: two-sided
Summary
I've used some extreme examples in this post to highlight the Kolmogorov-Smirnov and Shapiro-Wilk tests. In practice, your datasets may not be as extreme.
This work is licensed under a Creative Commons
Attribution 4.0 International License.
“In both examples, we cannot reject the null hypothesis that the two distributions are different.” do you mean the same? isn’t the null hypothesis that the distributions are the same?
you say lines below “In this case we get an extremely low p-value, and we can reject the null, which is that both the distributions are the same and they are not (one is a Normal distribution and the other a Poisson distribution).”
Thanks for the comment and you’re right. In light of this, I rewrote most of the post (since it was quite badly written!).
My problem is with the sample size. When comparing two gamma distributions, the p value changes with the number of samples which affects the acceptance / rejection decision. For example:
set.seed(123)
x