The Adjusted Rand index

In my last post, I wrote about the Rand index. This post will be on the Adjusted Rand index (ARI), which is the corrected-for-chance version of the Rand index:

AdjustedIndex = \frac{Index - ExpectedIndex}{MaxIndex - ExpectedIndex}

Given the contingency table:

Y_1 Y_2 \cdots Y_s Sums
X_1 n_{11} n_{12} \cdots n_{1s} a_1
X_2 n_{21} n_{22} \cdots n_{2s} a_2
\vdots \vdots \vdots \ddots \vdots \vdots
X_r n_{r1} n_{r2} \cdots n_{rs} a_r
Sums b_1 b_2 \cdots b_s

the adjusted index is:

ARI = \frac{ \sum_{ij} { {n_{ij}}\choose{2} } - [ \sum_{i} { {a_{i}}\choose{2} } \sum_{j} { {b_{j}}\choose{2} } ] / { {n}\choose{2} } } { \frac{1}{2} [ \sum_{i} { a_{i}\choose{2} } + \sum_{j} { {b_{j}}\choose{2} } ] - [ \sum_{i} { {a_{i}}\choose{2} } \sum_{j} { {b_{j}}\choose{2} } ] / { {n}\choose{2} } }

As per usual, it'll be easier to understand with an example. I'll use R to create two random sets of elements, which represent clustering results.

set.seed(1)
x <- sample(x = rep(1:3, 4), 12)
x
x
 [1] 1 2 3 3 2 1 1 3 3 1 2 2

set.seed(2)
y <- sample(x = rep(1:3, 4), 12)
y
 [1] 3 2 3 2 2 1 1 2 3 1 3 1

In this example there are 3 clusters in both sets, so our contingency table will have three rows and three columns. We just need to count the co-occurrences to build the contingency table. n_{11} would be the number of times an element occurs in cluster 1 of X and cluster 1 of Y; this occurs three times: the sixth, seventh, and tenth elements. Here's the full contingency table:

Y_1 Y_2 Y_3 Row sums
X_1 3 0 1 4
X_2 1 2 1 4
X_3 0 2 2 4
Column sums 4 4 4

If you look closely at the ARI formula, there's really just three different parts:

  1. \sum_{ij} { {n_{ij}}\choose{2} }
  2. \sum_{i} { {a_{i}}\choose{2} }
  3. \sum_{j} { {b_{j}}\choose{2} }

\sum means the sum, i refers to the row number, j refers to the column number, a refers to the row sum, and b refers to the column sum. Now let's work out each part.

  1. \sum_{ij} { {n_{ij}}\choose{2} } = { {3}\choose{2} } + { {0}\choose{2} } + { {1}\choose{2} } + { {1}\choose{2} } + { {2}\choose{2} } + { {1}\choose{2} } + { {0}\choose{2} } + { {2}\choose{2} } + { {2}\choose{2} } = 3 + 0 + 0 + 0 + 1 + 0 + 0 + 1 + 1 = 6
  2. \sum_{i} { {a_{i}}\choose{2} } = { {4}\choose{2} } + { {4}\choose{2} } + { {4}\choose{2} } = 6 + 6 + 6 = 18
  3. \sum_{j} { {b_{j}}\choose{2} } = { {4}\choose{2} } + { {4}\choose{2} } + { {4}\choose{2} } = 6 + 6 + 6 = 18

Substituting the values into the ARI formula we get:

ARI = \frac{6 - [18 \times 18] / { {12}\choose{2} } } {\frac{1}{2} [18+18] - [18 \times 18] / { {12}\choose{2} }} = \frac{6 - 4.909091}{18 - 4.909091} = 0.08333333

Using R

The clues package contains the adjustedRand() function that can calculate the Rand index and the ARI.

# install if you haven't already
install.packages("clues")
# load package
library(clues)

set.seed(1)
x <- sample(x = rep(1:3, 4), 12)
set.seed(2)
y <- sample(x = rep(1:3, 4), 12)

adjustedRand(x, y)
      Rand         HA         MA         FM    Jaccard 
0.63636364 0.08333333 0.25000000 0.33333333 0.20000000

The adjustedRand() function calculates:

the five agreement indices: Rand index, Hubert and Arabie's adjusted Rand index, Morey and Agresti's adjusted Rand index, Fowlkes and Mallows's index, and Jaccard index, which measure the agreement between any two partitions for a data set.

I guess the formula shown on Wikipedia must be the Hubert and Arabie's adjusted Rand index. Notice how different the Rand index is from the ARI, which makes sense since the example data used in this post is small and there would be a lot of overlaps just due to chance.




Creative Commons License
This work is licensed under a Creative Commons
Attribution 4.0 International License
.
11 comments Add yours
  1. Hi Dave

    First of all nice post and nice post on the difference between Rand index and the adjusted Rand index.

    I have staring at the contigency table and try to make sense of sense of sentence:

    Element occurs in cluster 1 of X and cluster 1 of Y; this occurs three times: the sixth, seventh, and tenth elements

    Can you elaborate on what you mean by the sixth, seventh, and tenth elements and what define you as an element ?
    best

    Jon

    1. Sorry, let me try to explain it again.

      We have a list of 12 items and I refer to each as an element. x and y contain the clustering results of these 12 items. In x, four elements belong to cluster one and in y, four elements belong to cluster one.

      If you examine the clustering results, you’ll see that the sixth, seventh, and tenth elements are in cluster one for both clustering results.

  2. Thank you, Thank you, Thank you.
    I have been getting bogged down trying to get my head around this (like so many my mind seems to slow down when hit with densely packed mathematical symbols that need translation) and I just needed to see it worked through. This blog did just that beautifully. Well done ?

  3. Yello Dave!

    Thank you for an interesting post, very educational and easy to follow. I’m having problem with the following statement:

    “n_11 would be the number of times an element occurs in cluster 1 of X and cluster 1 of Y”

    How do we know that cluster 1 of X corresponds to cluster 1 of Y? It is easy to think of a setting in which an original cluster A is split into two clusters, B and C. How does that translate into this setting?

    In the case of the Rand Index we looked at two internal points a & b, and how they were clustered by two clustering methods A and B. Are we not doing the same thing here?

    Best regards,
    Mamod

  4. Hi! ARI has a range value of [-1,1], where 1 stands for completely agreement between partitions ans 0 means both partitions are random. But what a negative value stands for? I mean, a value of -1 or -0.33 what does it means?
    Thanks in advances. David

  5. Hai Sir, I am a researcher from India . In my paper, I need to express an equation for ARI in terms of FP, TP, TN, and FN . We have the expression for Rand Index (RI)= (TP+TN)/(TP+TN+FP+FN). How we can write ARI as well as Purity in this regard?

  6. I didn’t see a package called “clues”, but I did see a “clue”. I’m going to install that one and see how it works out.
    going to get ‘clue’ now!

    1. The “clues” package is not available for current versions of R. However, it is still available in CRAN’s Github repository. To install from there, you may write :

      install.packages(“devtools”)
      devtools::install_github(“cran/clues”)
      require(clues)

      Thank you. Hope it helps.

      ~ Jyotishka

  7. If we have for X a cluster analysis resulting (say) in a map showing some clusters, and we have Y simply some regions on the (same) map, can we view these regions as “clusters” and do this test or must it be two sets of clusters?

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.