In my last post, I wrote about the Rand index. This post will be on the Adjusted Rand index (ARI), which is the corrected-for-chance version of the Rand index:

$AdjustedIndex = \frac{Index - ExpectedIndex}{MaxIndex - ExpectedIndex}$

Given the contingency table:

 $Y_1$ $Y_2$ $\cdots$ $Y_s$ $Sums$ $X_1$ $n_{11}$ $n_{12}$ $\cdots$ $n_{1s}$ $a_1$ $X_2$ $n_{21}$ $n_{22}$ $\cdots$ $n_{2s}$ $a_2$ $\vdots$ $\vdots$ $\vdots$ $\ddots$ $\vdots$ $\vdots$ $X_r$ $n_{r1}$ $n_{r2}$ $\cdots$ $n_{rs}$ $a_r$ $Sums$ $b_1$ $b_2$ $\cdots$ $b_s$

$ARI = \frac{ \sum_{ij} { {n_{ij}}\choose{2} } - [ \sum_{i} { {a_{i}}\choose{2} } \sum_{j} { {b_{j}}\choose{2} } ] / { {n}\choose{2} } } { \frac{1}{2} [ \sum_{i} { a_{i}\choose{2} } + \sum_{j} { {b_{j}}\choose{2} } ] - [ \sum_{i} { {a_{i}}\choose{2} } \sum_{j} { {b_{j}}\choose{2} } ] / { {n}\choose{2} } }$

As per usual, it’ll be easier to understand with an example. I’ll use R to create two random sets of elements, which represent clustering results.

set.seed(1)
x <- sample(x = rep(1:3, 4), 12)
x
x
[1] 1 2 3 3 2 1 1 3 3 1 2 2

set.seed(2)
y <- sample(x = rep(1:3, 4), 12)
y
[1] 3 2 3 2 2 1 1 2 3 1 3 1


In this example there are 3 clusters in both sets, so our contingency table will have three rows and three columns. We just need to count the co-occurrences to build the contingency table. $n_{11}$ would be the number of times an element occurs in cluster 1 of X and cluster 1 of Y; this occurs three times: the sixth, seventh, and tenth elements. Here’s the full contingency table:

 $Y_1$ $Y_2$ $Y_3$ $Row sums$ $X_1$ $3$ $0$ $1$ $4$ $X_2$ $1$ $2$ $1$ $4$ $X_3$ $0$ $2$ $2$ $4$ $Column sums$ $4$ $4$ $4$

If you look closely at the ARI formula, there’s really just three different parts:

1. $\sum_{ij} { {n_{ij}}\choose{2} }$
2. $\sum_{i} { {a_{i}}\choose{2} }$
3. $\sum_{j} { {b_{j}}\choose{2} }$

$\sum$ means the sum, $i$ refers to the row number, $j$ refers to the column number, $a$ refers to the row sum, and $b$ refers to the column sum. Now let’s work out each part.

1. $\sum_{ij} { {n_{ij}}\choose{2} } = { {3}\choose{2} } + { {0}\choose{2} } + { {1}\choose{2} } + { {1}\choose{2} } + { {2}\choose{2} } + { {1}\choose{2} } + { {0}\choose{2} } + { {2}\choose{2} } + { {2}\choose{2} } = 3 + 0 + 0 + 0 + 1 + 0 + 0 + 1 + 1 = 6$
2. $\sum_{i} { {a_{i}}\choose{2} } = { {4}\choose{2} } + { {4}\choose{2} } + { {4}\choose{2} } = 6 + 6 + 6 = 18$
3. $\sum_{j} { {b_{j}}\choose{2} } = { {4}\choose{2} } + { {4}\choose{2} } + { {4}\choose{2} } = 6 + 6 + 6 = 18$

Substituting the values into the ARI formula we get:

$ARI = \frac{6 - [18 \times 18] / { {12}\choose{2} } } {\frac{1}{2} [18+18] - [18 \times 18] / { {12}\choose{2} }} = \frac{6 - 4.909091}{18 - 4.909091} = 0.08333333$

## Using R

The clues package contains the adjustedRand() function that can calculate the Rand index and the ARI.

# install if you haven't already
install.packages("clues")
library(clues)

set.seed(1)
x <- sample(x = rep(1:3, 4), 12)
set.seed(2)
y <- sample(x = rep(1:3, 4), 12)

Rand         HA         MA         FM    Jaccard
0.63636364 0.08333333 0.25000000 0.33333333 0.20000000


the five agreement indices: Rand index, Hubert and Arabie’s adjusted Rand index, Morey and Agresti’s adjusted Rand index, Fowlkes and Mallows’s index, and Jaccard index, which measure the agreement between any two partitions for a data set.

I guess the formula shown on Wikipedia must be the Hubert and Arabie’s adjusted Rand index. Notice how different the Rand index is from the ARI, which makes sense since the example data used in this post is small and there would be a lot of overlaps just due to chance.

.
1. Jon says:

Hi Dave

First of all nice post and nice post on the difference between Rand index and the adjusted Rand index.

I have staring at the contigency table and try to make sense of sense of sentence:

Element occurs in cluster 1 of X and cluster 1 of Y; this occurs three times: the sixth, seventh, and tenth elements

Can you elaborate on what you mean by the sixth, seventh, and tenth elements and what define you as an element ?
best

Jon

1. Davo says:

Sorry, let me try to explain it again.

We have a list of 12 items and I refer to each as an element. x and y contain the clustering results of these 12 items. In x, four elements belong to cluster one and in y, four elements belong to cluster one.

If you examine the clustering results, you’ll see that the sixth, seventh, and tenth elements are in cluster one for both clustering results.

2. Andrew says:

Thank you, Thank you, Thank you.
I have been getting bogged down trying to get my head around this (like so many my mind seems to slow down when hit with densely packed mathematical symbols that need translation) and I just needed to see it worked through. This blog did just that beautifully. Well done ?

3. Mamod says:

Yello Dave!

Thank you for an interesting post, very educational and easy to follow. I’m having problem with the following statement:

“n_11 would be the number of times an element occurs in cluster 1 of X and cluster 1 of Y”

How do we know that cluster 1 of X corresponds to cluster 1 of Y? It is easy to think of a setting in which an original cluster A is split into two clusters, B and C. How does that translate into this setting?

In the case of the Rand Index we looked at two internal points a & b, and how they were clustered by two clustering methods A and B. Are we not doing the same thing here?

Best regards,
Mamod

4. David says:

Hi! ARI has a range value of [-1,1], where 1 stands for completely agreement between partitions ans 0 means both partitions are random. But what a negative value stands for? I mean, a value of -1 or -0.33 what does it means?