Hierarchical clustering

Hierarchical clustering is a cluster analysis method that builds a hierarchy of clusters. There are two types of strategies, agglomerative and divisive. This post is about agglomerative, the bottom up approach, hierarchical clustering.

head(USArrests)
#Murder Assault UrbanPop Rape
#Alabama      13.2     236       58 21.2
#Alaska       10.0     263       48 44.5
#Arizona       8.1     294       80 31.0
#Arkansas      8.8     190       50 19.5
#California    9.0     276       91 40.6
#Colorado      7.9     204       78 38.7
plot(USArrests)

You can see some correlation between assault and murder. To test the correlation

cor(x=USArrests[,1],y=USArrests[,2],method="pearson")
#[1] 0.8018733
cor(x=USArrests[,1],y=USArrests[,2],method="spearman")
#[1] 0.8172735
cor(x=USArrests[,3],y=USArrests[,4],method="spearman")
#[1] 0.4381068
#Calculate the euclidean distance and make a dendrogram using hierarchical clustering with complete/maximum linkage
d = dist(USArrests,method="euclidean")
h = hclust(d,method="complete")
plot(h)

Now to see what's going on:

x <- sample(1:10000,7)
x
#[1] 9923 1678 9371 8602 9788 4008 4598
y <- sample(1:10000,7)
y
#[1] 2466 6768 1442 8307  211 2722 9948
test <- data.frame(x,y)
#default is euclidean
> dist(test)
#          1         2         3         4         5         6
#2  9299.851
#3  1163.306  9356.737
#4  5988.516  7092.975  6907.936
#5  2259.037 10429.111  1299.712  8182.409
#6  5920.537  4668.942  5513.635  7231.671  6301.866
#7  9183.461  4317.268  9753.644  4327.227 11033.824  7250.047
library("rggobi")
g <- ggobi(test)

#manually calculate the euclidean distance from 2 to 7 by using Pythagoras
sqrt((4598 - 1678)^2 + (9948 - 6768)^2)
#[1] 4317.268
#if we examine the distance matrix, we see that 7,2 is 4317.268
d
#          1         2         3         4         5         6
#2  9299.851                                                  
#3  1163.306  9356.737                                        
#4  5988.516  7092.975  6907.936                              
#5  2259.037 10429.111  1299.712  8182.409                    
#6  5920.537  4668.942  5513.635  7231.671  6301.866          
#7  9183.461  4317.268  9753.644  4327.227 11033.824  7250.047
#By looking at the distance matrix we can see that 1 and 3 are the closest together
#i.e. smallest distance apart
#now for hierarchical clustering
op<-par(mfrow=c(1,3))
plot(hclust(d,method="single"))
plot(hclust(d,method="average"))
plot(hclust(d,method="complete"))

So hierarchical clustering first takes the two closest points, which in this case is 1 and 3. Next it finds the next closest point to 1 and 3, which is 5. However depending on the agglomeration method, the distances are calculated in a different manner.

In single linkage clustering once 1 and 3 are clustered together, the distance to 5 is either the distance from 1 to 5 or 3 to 5, depending on which is closer. In this case, 1 -> 5 is 2259.037 and 3 -> 5 is 1299.712; since 5 is closer to 3, the distance of 5 to the 1 and 3 pair is therefore 1299.712.

Complete linkage is the opposite of single linkage in that the furthest distance is used, which in this case is 2259.037 i.e. the distance of 1 -> 5. Average linkage is the average of 1 -> 5 (2259.037) and 3 -> 5 (1299.712), which is 1779.374.

Once 1, 3 and 5 are clustered together, the next closest value is sort out and this process continues (which is why this is called hierarchical clustering) until each point is clustered.

Euclidean distance of two vectors

How about calculating the Euclidean distance of a vector?

x <- sample(1:10000,7)
y <- sample(1:10000,7)
z <- sample(1:10000,7)
test <- data.frame(x,y,z)
test
     x    y    z
1 8629 3148 1810
2 1449 7282 4373
3 5522 3897  654
4 4533 5946  636
5 5089 4347 9358
6 4415  867 8849
7 8957 5031 2201
dist(test)
         1        2        3        4        5        6
2 8672.446                                             
3 3398.645 6471.361                                    
4 5097.479 5026.044 2275.268                           
5 8422.678 6834.768 8726.374 8884.775                  
6 8515.179 8365.641 8807.064 9657.305 3581.027         
7 1950.937 8133.551 3934.259 4781.028 8164.063 9064.467
#define function for calculating Euclidean distance
euclidean_distance <- function(x1,x2){
   sqrt(sum((x1 - x2) ^ 2))
}
#calculate distance from 1 to 2
x1 <- c(8629,3148,1810)
x2 <- c(1449,7282,4373)
euclidean_distance(x1,x2)
[1] 8672.446

Parallelisation

The amap package parallelises the distance calculation step and the hierarchical clustering. This is extremely useful when you have a large dataset.

#install
install.packages("amap")
#load
library("amap")
#adjust nbproc to the number of available processors
dist <- Dist(big_df, nbproc=30)
hc <- hcluster(dist, nbproc=30)