The Adjusted Rand index

In my last post, I wrote about the Rand index. This post will be on the Adjusted Rand index (ARI), which is the corrected-for-chance version of the Rand index:

AdjustedIndex = \frac{Index - ExpectedIndex}{MaxIndex - ExpectedIndex}

Given the contingency table:

Y_1 Y_2 \cdots Y_s Sums
X_1 n_{11} n_{12} \cdots n_{1s} a_1
X_2 n_{21} n_{22} \cdots n_{2s} a_2
\vdots \vdots \vdots \ddots \vdots \vdots
X_r n_{r1} n_{r2} \cdots n_{rs} a_r
Sums b_1 b_2 \cdots b_s

the adjusted index is:

ARI = \frac{ \sum_{ij} { {n_{ij}}\choose{2} } - [ \sum_{i} { {a_{i}}\choose{2} } \sum_{j} { {b_{j}}\choose{2} } ] / { {n}\choose{2} } } { \frac{1}{2} [ \sum_{i} { a_{i}\choose{2} } + \sum_{j} { {b_{j}}\choose{2} } ] - [ \sum_{i} { {a_{i}}\choose{2} } \sum_{j} { {b_{j}}\choose{2} } ] / { {n}\choose{2} } }

As per usual, it'll be easier to understand with an example. I'll use R to create two random sets of elements, which represent clustering results.

set.seed(1)
x <- sample(x = rep(1:3, 4), 12)
x
x
 [1] 1 2 3 3 2 1 1 3 3 1 2 2

set.seed(2)
y <- sample(x = rep(1:3, 4), 12)
y
 [1] 3 2 3 2 2 1 1 2 3 1 3 1

In this example there are 3 clusters in both sets, so our contingency table will have three rows and three columns. We just need to count the co-occurrences to build the contingency table. n_{11} would be the number of times an element occurs in cluster 1 of X and cluster 1 of Y; this occurs three times: the sixth, seventh, and tenth elements. Here's the full contingency table:

Y_1 Y_2 Y_3 Row sums
X_1 3 0 1 4
X_2 1 2 1 4
X_3 0 2 2 4
Column sums 4 4 4

If you look closely at the ARI formula, there's really just three different parts:

  1. \sum_{ij} { {n_{ij}}\choose{2} }
  2. \sum_{i} { {a_{i}}\choose{2} }
  3. \sum_{j} { {b_{j}}\choose{2} }

\sum means the sum, i refers to the row number, j refers to the column number, a refers to the row sum, and b refers to the column sum. Now let's work out each part.

  1. \sum_{ij} { {n_{ij}}\choose{2} } = { {3}\choose{2} } + { {0}\choose{2} } + { {1}\choose{2} } + { {1}\choose{2} } + { {2}\choose{2} } + { {1}\choose{2} } + { {0}\choose{2} } + { {2}\choose{2} } + { {2}\choose{2} } = 3 + 0 + 0 + 0 + 1 + 0 + 0 + 1 + 1 = 6
  2. \sum_{i} { {a_{i}}\choose{2} } = { {4}\choose{2} } + { {4}\choose{2} } + { {4}\choose{2} } = 6 + 6 + 6 = 18
  3. \sum_{j} { {b_{j}}\choose{2} } = { {4}\choose{2} } + { {4}\choose{2} } + { {4}\choose{2} } = 6 + 6 + 6 = 18

Substituting the values into the ARI formula we get:

ARI = \frac{6 - [18 \times 18] / { {12}\choose{2} } } {\frac{1}{2} [18+18] - [18 \times 18] / { {12}\choose{2} }} = \frac{6 - 4.909091}{18 - 4.909091} = 0.08333333

Using R

The clues package contains the adjustedRand() function that can calculate the Rand index and the ARI.

# install if you haven't already
install.packages("clues")
# load package
library(clues)

set.seed(1)
x <- sample(x = rep(1:3, 4), 12)
set.seed(2)
y <- sample(x = rep(1:3, 4), 12)

adjustedRand(x, y)
      Rand         HA         MA         FM    Jaccard 
0.63636364 0.08333333 0.25000000 0.33333333 0.20000000

The adjustedRand() function calculates:

the five agreement indices: Rand index, Hubert and Arabie's adjusted Rand index, Morey and Agresti's adjusted Rand index, Fowlkes and Mallows's index, and Jaccard index, which measure the agreement between any two partitions for a data set.

I guess the formula shown on Wikipedia must be the Hubert and Arabie's adjusted Rand index. Notice how different the Rand index is from the ARI, which makes sense since the example data used in this post is small and there would be a lot of overlaps just due to chance.

Print Friendly, PDF & Email



Creative Commons License
This work is licensed under a Creative Commons
Attribution 4.0 International License
.

Leave a Reply

Your email address will not be published. Required fields are marked *