Given the contingency table:
the adjusted index is:
As per usual, it’ll be easier to understand with an example. I’ll use R to create two random sets of elements, which represent clustering results.
set.seed(1) x <- sample(x = rep(1:3, 4), 12) x x  1 2 3 3 2 1 1 3 3 1 2 2 set.seed(2) y <- sample(x = rep(1:3, 4), 12) y  3 2 3 2 2 1 1 2 3 1 3 1
In this example there are 3 clusters in both sets, so our contingency table will have three rows and three columns. We just need to count the co-occurrences to build the contingency table. would be the number of times an element occurs in cluster 1 of X and cluster 1 of Y; this occurs three times: the sixth, seventh, and tenth elements. Here’s the full contingency table:
If you look closely at the ARI formula, there’s really just three different parts:
means the sum, refers to the row number, refers to the column number, refers to the row sum, and refers to the column sum. Now let’s work out each part.
Substituting the values into the ARI formula we get:
The clues package contains the adjustedRand() function that can calculate the Rand index and the ARI.
# install if you haven't already install.packages("clues") # load package library(clues) set.seed(1) x <- sample(x = rep(1:3, 4), 12) set.seed(2) y <- sample(x = rep(1:3, 4), 12) adjustedRand(x, y) Rand HA MA FM Jaccard 0.63636364 0.08333333 0.25000000 0.33333333 0.20000000
The adjustedRand() function calculates:
the five agreement indices: Rand index, Hubert and Arabie’s adjusted Rand index, Morey and Agresti’s adjusted Rand index, Fowlkes and Mallows’s index, and Jaccard index, which measure the agreement between any two partitions for a data set.
I guess the formula shown on Wikipedia must be the Hubert and Arabie’s adjusted Rand index. Notice how different the Rand index is from the ARI, which makes sense since the example data used in this post is small and there would be a lot of overlaps just due to chance.
This work is licensed under a Creative Commons
Attribution 4.0 International License.