I've been looking for ways to compare clustering results and through my searching I came across something called the Rand index. In this short post, I explain how this index is calculated.

From the Wikipedia page you can see that the Rand index, R, is calculated by:

Ignoring the numerator for now, notice that the denominator is a binomial coefficient. is the number of unordered pairs in a set of n elements. For example, if you have set of 4 elements {a, b, c, d}, there are 6 unordered pairs: {a, b}, {a, c}, {a, d}, {b, c}, {b, d}, and {c, d}. The in the formula refers to the number of times a pair of elements belongs to a same cluster across two different clustering results and the refers to the number of times a pair of elements are in different clusters across two different clustering results. It will be easier to understand the Rand index with a simple example.

Say we have a set of six elements: {a, b, c, d, e, f}. Clustering method 1 (CM1) forms three clusters; the first two items are in group 1, the third and fourth are in group 2, and the fifth and sixth are in group 3: {1, 1, 2, 2, 3, 3}. Clustering method 2 (CM2) forms two clusters; the first three items are in group 1 and the last three items are in group 2: {1, 1, 1, 2, 2, 2}. What's the Rand index of these two clustering results?

To manually calculate the Rand index, we need to go through every unordered pair to work out and ; let's work out first. There are 15 unordered pairs in a set of six elements: {a, b}, {a, c}, {a, d}, {a, e}, {a, f}, {b, c}, {b, d}, {b, e}, {b, f}, {c, d}, {c, e}, {c, f}, {d, e}, {d, f}, and {e, f}. is every time a pair of elements is grouped together by the two clustering methods. {a, b} and {e, f} are clustered together by CM1 and CM2, so = 2. is every time a pair of elements is not grouped together by the two clustering methods. {a, d}, {a, e}, {a, f}, {b, d}, {b, e}, {b, f}, {c, e}, and {c, f} are not clustered together by CM1 and CM2, so = 8. and are the times that both clustering methods agree. Therefore, the Rand index is:

Using the fossil package and the rand.index() function, we can work out the Rand index in R.

# CM1 a <- rep(1:3, each = 2) a [1] 1 1 2 2 3 3 # CM2 b <- rep(1:2, each = 3) b [1] 1 1 1 2 2 2 # install package if you haven't already install.packages("fossil") # load package library(fossil) rand.index(a, b) [1] 0.6666667

Let's take a look at the rand.index() function.

rand.index function (group1, group2) { x <- abs(sapply(group1, function(x) x - group1)) x[x > 1] <- 1 y <- abs(sapply(group2, function(x) x - group2)) y[y > 1] <- 1 sg <- sum(abs(x - y))/2 bc <- choose(dim(x)[1], 2) ri <- 1 - sg/bc return(ri) } <environment: namespace:fossil>

Let's go through the code step by step to see what it is doing.

group1 <- rep(1:3, 2) group1 [1] 1 2 3 1 2 3 group2 <- rep(1:3, each = 2) group2 [1] 1 1 2 2 3 3 x <- abs(sapply(group1, function(x) x - group1)) x [,1] [,2] [,3] [,4] [,5] [,6] [1,] 0 1 2 0 1 2 [2,] 1 0 1 1 0 1 [3,] 2 1 0 2 1 0 [4,] 0 1 2 0 1 2 [5,] 1 0 1 1 0 1 [6,] 2 1 0 2 1 0 x[x > 1] <- 1 x [,1] [,2] [,3] [,4] [,5] [,6] [1,] 0 1 1 0 1 1 [2,] 1 0 1 1 0 1 [3,] 1 1 0 1 1 0 [4,] 0 1 1 0 1 1 [5,] 1 0 1 1 0 1 [6,] 1 1 0 1 1 0 y <- abs(sapply(group2, function(x) x - group2)) y [,1] [,2] [,3] [,4] [,5] [,6] [1,] 0 0 1 1 2 2 [2,] 0 0 1 1 2 2 [3,] 1 1 0 0 1 1 [4,] 1 1 0 0 1 1 [5,] 2 2 1 1 0 0 [6,] 2 2 1 1 0 0 y[y > 1] <- 1 y [,1] [,2] [,3] [,4] [,5] [,6] [1,] 0 0 1 1 1 1 [2,] 0 0 1 1 1 1 [3,] 1 1 0 0 1 1 [4,] 1 1 0 0 1 1 [5,] 1 1 1 1 0 0 [6,] 1 1 1 1 0 0

Each row and column number in x and y correspond to the element number in group1 and group2, respectively. A zero indicates a match, i.e. that a particular element is in the same cluster as another element, and a one indicates that they belong to different clusters. The x and y matrices therefore list all the pairs (twice) and whether or not they belong to the same cluster.

Now the function can compare the two different groups of clusters to get the number of disagreements. Finally to get the Rand Index, which reflects the number of agreements, we subtract the fraction of disagreements from 1.

# divide by two because the pairs are counted twice # sum to get all the disagreements sg <- sum(abs(x - y))/2 sg [1] 6 # get the total number of pairs bc <- choose(dim(x)[1], 2) bc [1] 15 ri <- 1 - sg/bc ri [1] 0.6

## Assessing clustering results

We can use the Rand index to assess a clustering approach. I'll illustrate using *k*-means clustering using the famous iris dataset.

# check out the data head(iris) Sepal.Length Sepal.Width Petal.Length Petal.Width Species 1 5.1 3.5 1.4 0.2 setosa 2 4.9 3.0 1.4 0.2 setosa 3 4.7 3.2 1.3 0.2 setosa 4 4.6 3.1 1.5 0.2 setosa 5 5.0 3.6 1.4 0.2 setosa 6 5.4 3.9 1.7 0.4 setosa # true labels true_label <- as.numeric(iris$Species) true_label [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 [64] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 [127] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 # perform k-means clustering my_kmeans <- kmeans(x = iris[,-5], centers = 3) # clustering results my_kmeans$cluster [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3 2 3 3 3 3 3 3 3 3 3 3 [64] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 3 2 2 2 2 3 2 2 2 2 2 2 3 3 2 2 2 2 3 2 3 2 3 2 2 [127] 3 3 2 2 2 2 2 3 2 2 2 2 3 2 2 2 3 2 2 2 3 2 2 3 # Rand index rand.index(true_label, my_kmeans$cluster) [1] 0.8797315

From the Wikipedia article:

The Rand index has a value between 0 and 1, with 0 indicating that the two data clusterings do not agree on any pair of points and 1 indicating that the data clusterings are exactly the same.

The Rand index suggests that the *k*-means clustering of the iris data using sepal and petal measurements is similar to the real "clustering" of the data.

This work is licensed under a Creative Commons

Attribution 4.0 International License.

Great explanation of the process, I struggled to find a clear simple explanation for this process and you provided exactly what I needed (while MANY other sites did not). Many thanks ?

Really nice article!

But I wonder why others use F-score measures (TP,TN,FP,FN) like here https://stackoverflow.com/questions/49586742/rand-index-function-clustering-performance-evaluation and here https://stats.stackexchange.com/questions/89030/rand-index-calculation ??

Also, but this is a minor note, I don’t get why articles take for granted “true labels” where actually this is just to measure agreement of two clusters, not necessarily any of them should be the ground truth.

I agree with you on both points. This index is used for comparing clustering results and ground truth data is not always available, which is why people perform clustering in the first place.

Thank you so much! I am taking an online class that just linked to wikipedia rather than explain the algorithm/formula. This was so easy to understand!

Brilliant – thank you for your posts!

tnx a million

really helpful because of the example mentioned