In my last post, I wrote about the Rand index. This post will be on the Adjusted Rand index (ARI), which is the corrected-for-chance version of the Rand index:

$AdjustedIndex = \frac{Index - ExpectedIndex}{MaxIndex - ExpectedIndex}$

Given the contingency table:

 $Y_1$ $Y_2$ $\cdots$ $Y_s$ $Sums$ $X_1$ $n_{11}$ $n_{12}$ $\cdots$ $n_{1s}$ $a_1$ $X_2$ $n_{21}$ $n_{22}$ $\cdots$ $n_{2s}$ $a_2$ $\vdots$ $\vdots$ $\vdots$ $\ddots$ $\vdots$ $\vdots$ $X_r$ $n_{r1}$ $n_{r2}$ $\cdots$ $n_{rs}$ $a_r$ $Sums$ $b_1$ $b_2$ $\cdots$ $b_s$

$ARI = \frac{ \sum_{ij} { {n_{ij}}\choose{2} } - [ \sum_{i} { {a_{i}}\choose{2} } \sum_{j} { {b_{j}}\choose{2} } ] / { {n}\choose{2} } } { \frac{1}{2} [ \sum_{i} { a_{i}\choose{2} } + \sum_{j} { {b_{j}}\choose{2} } ] - [ \sum_{i} { {a_{i}}\choose{2} } \sum_{j} { {b_{j}}\choose{2} } ] / { {n}\choose{2} } }$

As per usual, it'll be easier to understand with an example. I'll use R to create two random sets of elements, which represent clustering results.

set.seed(1)
x <- sample(x = rep(1:3, 4), 12)
x
x
[1] 1 2 3 3 2 1 1 3 3 1 2 2

set.seed(2)
y <- sample(x = rep(1:3, 4), 12)
y
[1] 3 2 3 2 2 1 1 2 3 1 3 1


In this example there are 3 clusters in both sets, so our contingency table will have three rows and three columns. We just need to count the co-occurrences to build the contingency table. $n_{11}$ would be the number of times an element occurs in cluster 1 of X and cluster 1 of Y; this occurs three times: the sixth, seventh, and tenth elements. Here's the full contingency table:

 $Y_1$ $Y_2$ $Y_3$ $Row sums$ $X_1$ $3$ $0$ $1$ $4$ $X_2$ $1$ $2$ $1$ $4$ $X_3$ $0$ $2$ $2$ $4$ $Column sums$ $4$ $4$ $4$

If you look closely at the ARI formula, there's really just three different parts:

1. $\sum_{ij} { {n_{ij}}\choose{2} }$
2. $\sum_{i} { {a_{i}}\choose{2} }$
3. $\sum_{j} { {b_{j}}\choose{2} }$

$\sum$ means the sum, $i$ refers to the row number, $j$ refers to the column number, $a$ refers to the row sum, and $b$ refers to the column sum. Now let's work out each part.

1. $\sum_{ij} { {n_{ij}}\choose{2} } = { {3}\choose{2} } + { {0}\choose{2} } + { {1}\choose{2} } + { {1}\choose{2} } + { {2}\choose{2} } + { {1}\choose{2} } + { {0}\choose{2} } + { {2}\choose{2} } + { {2}\choose{2} } = 3 + 0 + 0 + 0 + 1 + 0 + 0 + 1 + 1 = 6$
2. $\sum_{i} { {a_{i}}\choose{2} } = { {4}\choose{2} } + { {4}\choose{2} } + { {4}\choose{2} } = 6 + 6 + 6 = 18$
3. $\sum_{j} { {b_{j}}\choose{2} } = { {4}\choose{2} } + { {4}\choose{2} } + { {4}\choose{2} } = 6 + 6 + 6 = 18$

Substituting the values into the ARI formula we get:

$ARI = \frac{6 - [18 \times 18] / { {12}\choose{2} } } {\frac{1}{2} [18+18] - [18 \times 18] / { {12}\choose{2} }} = \frac{6 - 4.909091}{18 - 4.909091} = 0.08333333$

## Using R

The clues package contains the adjustedRand() function that can calculate the Rand index and the ARI.

# install if you haven't already
install.packages("clues")
library(clues)

set.seed(1)
x <- sample(x = rep(1:3, 4), 12)
set.seed(2)
y <- sample(x = rep(1:3, 4), 12)

Rand         HA         MA         FM    Jaccard
0.63636364 0.08333333 0.25000000 0.33333333 0.20000000