Hierarchical clustering is a cluster analysis method that builds a hierarchy of clusters. There are two types of strategies, agglomerative and divisive. This post is about agglomerative, the bottom up approach, hierarchical clustering.

head(USArrests) #Murder Assault UrbanPop Rape #Alabama 13.2 236 58 21.2 #Alaska 10.0 263 48 44.5 #Arizona 8.1 294 80 31.0 #Arkansas 8.8 190 50 19.5 #California 9.0 276 91 40.6 #Colorado 7.9 204 78 38.7 plot(USArrests)

You can see some correlation between assault and murder. To test the correlation

cor(x=USArrests[,1],y=USArrests[,2],method="pearson") #[1] 0.8018733 cor(x=USArrests[,1],y=USArrests[,2],method="spearman") #[1] 0.8172735 cor(x=USArrests[,3],y=USArrests[,4],method="spearman") #[1] 0.4381068 #Calculate the euclidean distance and make a dendrogram using hierarchical clustering with complete/maximum linkage d = dist(USArrests,method="euclidean") h = hclust(d,method="complete") plot(h)

Now to see what's going on:

x <- sample(1:10000,7) x #[1] 9923 1678 9371 8602 9788 4008 4598 y <- sample(1:10000,7) y #[1] 2466 6768 1442 8307 211 2722 9948 test <- data.frame(x,y) #default is euclidean > dist(test) # 1 2 3 4 5 6 #2 9299.851 #3 1163.306 9356.737 #4 5988.516 7092.975 6907.936 #5 2259.037 10429.111 1299.712 8182.409 #6 5920.537 4668.942 5513.635 7231.671 6301.866 #7 9183.461 4317.268 9753.644 4327.227 11033.824 7250.047 library("rggobi") g <- ggobi(test)

#manually calculate the euclidean distance from 2 to 7 by using Pythagoras sqrt((4598 - 1678)^2 + (9948 - 6768)^2) #[1] 4317.268 #if we examine the distance matrix, we see that 7,2 is 4317.268 d # 1 2 3 4 5 6 #2 9299.851 #3 1163.306 9356.737 #4 5988.516 7092.975 6907.936 #5 2259.037 10429.111 1299.712 8182.409 #6 5920.537 4668.942 5513.635 7231.671 6301.866 #7 9183.461 4317.268 9753.644 4327.227 11033.824 7250.047 #By looking at the distance matrix we can see that 1 and 3 are the closest together #i.e. smallest distance apart #now for hierarchical clustering op<-par(mfrow=c(1,3)) plot(hclust(d,method="single")) plot(hclust(d,method="average")) plot(hclust(d,method="complete"))

So hierarchical clustering first takes the two closest points, which in this case is 1 and 3. Next it finds the next closest point to 1 and 3, which is 5. However depending on the agglomeration method, the distances are calculated in a different manner.

In single linkage clustering once 1 and 3 are clustered together, the distance to 5 is either the distance from 1 to 5 or 3 to 5, depending on which is closer. In this case, 1 -> 5 is 2259.037 and 3 -> 5 is 1299.712; since 5 is closer to 3, the distance of 5 to the 1 and 3 pair is therefore 1299.712.

Complete linkage is the opposite of single linkage in that the furthest distance is used, which in this case is 2259.037 i.e. the distance of 1 -> 5. Average linkage is the average of 1 -> 5 (2259.037) and 3 -> 5 (1299.712), which is 1779.374.

Once 1, 3 and 5 are clustered together, the next closest value is sort out and this process continues (which is why this is called hierarchical clustering) until each point is clustered.

### Euclidean distance of two vectors

How about calculating the Euclidean distance of a vector?

x <- sample(1:10000,7) y <- sample(1:10000,7) z <- sample(1:10000,7) test <- data.frame(x,y,z) test x y z 1 8629 3148 1810 2 1449 7282 4373 3 5522 3897 654 4 4533 5946 636 5 5089 4347 9358 6 4415 867 8849 7 8957 5031 2201 dist(test) 1 2 3 4 5 6 2 8672.446 3 3398.645 6471.361 4 5097.479 5026.044 2275.268 5 8422.678 6834.768 8726.374 8884.775 6 8515.179 8365.641 8807.064 9657.305 3581.027 7 1950.937 8133.551 3934.259 4781.028 8164.063 9064.467 #define function for calculating Euclidean distance euclidean_distance <- function(x1,x2){ sqrt(sum((x1 - x2) ^ 2)) } #calculate distance from 1 to 2 x1 <- c(8629,3148,1810) x2 <- c(1449,7282,4373) euclidean_distance(x1,x2) [1] 8672.446

### Parallelisation

The amap package parallelises the distance calculation step and the hierarchical clustering. This is extremely useful when you have a large dataset.

#install install.packages("amap") #load library("amap") #adjust nbproc to the number of available processors dist <- Dist(big_df, nbproc=30) hc <- hcluster(dist, nbproc=30)