Comparing different distributions

Updated 2017 September 7th

The Kolmogorov-Smirnov test can be used to test whether two underlying one-dimensional probability distributions differ. As noted in the Wikipedia article:

Note that the two-sample test checks whether the two data samples come from the same distribution. This does not specify what that common distribution is (e.g. whether it's normal or not normal).

Let's generate two samples that follow the Poisson distribution with the same parameters.

# example 1
set.seed(123)
x <- rpois(n=1000, lambda=100)
set.seed(234)
y <- rpois(n=1000, lambda=100)

# perform a two-sample Kolmogorov-Smirnov test
ks.test(x,y)

	Two-sample Kolmogorov-Smirnov test

data:  x and y
D = 0.036, p-value = 0.5361
alternative hypothesis: two-sided

Warning message:
In ks.test(x, y) : p-value will be approximate in the presence of ties

The null hypothesis is that both samples come from the same distribution and is not rejected (p-value = 0.5361) since they do come from the exact same distribution.

The warning message is due to the implementation of the KS test in R, which expects a continuous distribution and thus there should not be any identical values in the two datasets i.e. ties. I've read several sources and they all mention that the KS test can deal with both discrete and continuous data (I'm guessing because it mainly deals with cumulative quantiles) but I'm not sure about the implementation in R. For more information, see this page.

In the example below, some variance is added using the jitter() function.

# example 2

set.seed(123)
x <- rpois(n=1000, lambda=100)

summary(x)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   66.0    93.0    99.5    99.5   106.0   130.0

# add some noise
set.seed(123)
x <- jitter(x, factor=100)

summary(x)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  52.30   88.82   99.71   99.40  109.43  148.52

set.seed(234)
y <- rpois(n=1000, lambda=100)
set.seed(234)
y <- jitter(y, factor=100)

ks.test(x,y)

	Two-sample Kolmogorov-Smirnov test

data:  x and y
D = 0.027, p-value = 0.8593
alternative hypothesis: two-sided

plot(ecdf(x = x), main = "ECDF of x and y")
lines(ecdf(x = y), col = 2)

Despite the noise, the two distributions are quite similar.

Let's perform the test using two samples following different distributions.

set.seed(123)
x <- rpois(n=1000, lambda=100)
set.seed(123)
x <- jitter(x, factor=100)
set.seed(234)
y <- rnorm(n=1000, mean=100)
set.seed(234)
y <- jitter(y, factor=100)
ks.test(x,y)

	Two-sample Kolmogorov-Smirnov test

data:  x and y
D = 0.439, p-value < 2.2e-16
alternative hypothesis: two-sided

plot(ecdf(x = x), main = "ECDF of x and y")
lines(ecdf(x = y), col = 2)

The two samples clearly have very different distributions.

In the example above, we get an extremely low p-value and we can reject the null hypothesis that both samples come from the same distribution, which they clearly don't.

Testing for normality

Many statistical tests assume that the data is normally distributed. One approach for determining normality is using a QQ-normal plot.

set.seed(123)
norm_data <- rnorm(n=1000, mean=200, sd=50)
qqnorm(norm_data, pch = 16)
qqline(norm_data, col = 2)

The points fall on the line that follows a normal distribution.

set.seed(123)
gamma_data <- rgamma(n=1000, shape=1)
qqnorm(gamma_data, pch = 16)
qqline(gamma_data, col = 2)

The points do not fall on the line that follows a normal distribution.

The Shapiro-Wilk test can test whether a sample comes from a normally distributed population.

shapiro.test(norm_data)

	Shapiro-Wilk normality test

data:  norm_data
W = 0.9984, p-value = 0.4765

The null hypothesis is that our sample is normally distributed and we cannot reject the null hypothesis (p-value = 0.4765). What about our sample generated using the Gamma distribution?

shapiro.test(gamma_data)

	Shapiro-Wilk normality test

data:  gamma_data
W = 0.8146, p-value < 2.2e-16

We can reject the null hypothesis, since gamma_data is not normally distributed.

Finally, using the two sample Kolmogorov–Smirnov test on norm_data and gamma_data.

ks.test(norm_data, gamma_data)

	Two-sample Kolmogorov-Smirnov test

data:  norm_data and gamma_data
D = 1, p-value < 2.2e-16
alternative hypothesis: two-sided

Summary

I've used some extreme examples in this post to highlight the Kolmogorov-Smirnov and Shapiro-Wilk tests. In practice, your datasets may not be as extreme.

Print Friendly, PDF & Email



Creative Commons License
This work is licensed under a Creative Commons
Attribution 4.0 International License
.
4 comments Add yours
  1. “In both examples, we cannot reject the null hypothesis that the two distributions are different.” do you mean the same? isn’t the null hypothesis that the distributions are the same?
    you say lines below “In this case we get an extremely low p-value, and we can reject the null, which is that both the distributions are the same and they are not (one is a Normal distribution and the other a Poisson distribution).”

    1. Thanks for the comment and you’re right. In light of this, I rewrote most of the post (since it was quite badly written!).

  2. My problem is with the sample size. When comparing two gamma distributions, the p value changes with the number of samples which affects the acceptance / rejection decision. For example:
    set.seed(123)
    x

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.