Hierarchical clustering - Dave Tang's blog

Hierarchical clustering is a cluster analysis method that builds a hierarchy of clusters. There are two types of strategies, agglomerative and divisive. This post is about agglomerative, the bottom up approach, hierarchical clustering using the USArrests data set.

dim(USArrests)
[1] 50  4

head(USArrests)
           Murder Assault UrbanPop Rape
Alabama      13.2     236       58 21.2
Alaska       10.0     263       48 44.5
Arizona       8.1     294       80 31.0
Arkansas      8.8     190       50 19.5
California    9.0     276       91 40.6
Colorado      7.9     204       78 38.7

plot(USArrests)

You can see some correlation between assault and murder.

Prettier plot that also shows Pearson correlations using GGally.

install.packages("GGally")
library(GGally)
ggpairs(USArrests)

Calculate the correlations.

cor(USArrests$Murder, USArrests$Assault)
[1] 0.8018733

cor(USArrests$Murder, USArrests$Assault, method = "spearman")
[1] 0.8172735

cor(USArrests$UrbanPop, USArrests$Rape)
[1] 0.4113412

Perform hierarchical clustering using the Euclidean distance and complete/maximum linkage.

d <- dist(USArrests, method="euclidean")
h <- hclust(d, method="complete")
plot(h)

Let's break down the entire process; this is how the Euclidean distance is calculated.

set.seed(31)
x <- sample(1:10000, 7)
x
[1] 5222 9499 4269 4244 9444 4520 8322

set.seed(32)
y <- sample(1:10000, 7)
y
[1] 5059 5948 8086 7287 1520 9558 7531

test <- data.frame(x, y)

# calculate Euclidean distance
dist(test)
         1        2        3        4        5        6
2 4368.415                                             
3 3173.474 5650.128                                    
4 2433.201 5422.909  799.391                           
5 5509.066 4428.342 8360.202 7765.197                  
6 4553.439 6150.003 1493.246 2287.710 9426.305         
7 3964.944 1972.617 4090.823 4085.293 6114.818 4308.588

plot(test, type = "n")
text(test)

We can see that some points are closer to each other.

The distance formula in Cartesian coordinates is derived from the Pythagorean theorem. Let's work out the distance from point 2 to 7.

test
     x    y
1 5222 5059
2 9499 5948
3 4269 8086
4 4244 7287
5 9444 1520
6 4520 9558
7 8322 7531

sqrt((9499 - 8322)^2 + (5948 - 7531)^2)
[1] 1972.617

# the above result is the same as the result from dist()
dist(test)
         1        2        3        4        5        6
2 4368.415                                             
3 3173.474 5650.128                                    
4 2433.201 5422.909  799.391                           
5 5509.066 4428.342 8360.202 7765.197                  
6 4553.439 6150.003 1493.246 2287.710 9426.305         
7 3964.944 1972.617 4090.823 4085.293 6114.818 4308.588

# from the plot above we can see that 3 and 4 are the closest
# this is also the case in the distance matrix above

Now let's compare the different agglomeration methods.

op<-par(mfrow=c(1,3))
plot(hclust(dist(test), method="single"))
plot(hclust(dist(test), method="average"))
plot(hclust(dist(test), method="complete"))

Hierarchical clustering first takes the two closest points, which in this case is 3 and 4, and finds the closest point to 1 and 3, which is 6. However, depending on the agglomeration method, the distances are calculated in a different manner.

In single linkage clustering once 3 and 4 are clustered together, the distance to 6 is either the distance from 3 to 6 or 4 to 6, depending on which is closer. 3 -> 6 is 1493.246 and 4 to 6 is 2287.710, so the distance of 6 to 1 and 3 is 1493.246. Complete linkage is the opposite of single linkage in that the furthest distance is used, so the distance of 6 to 1 and 3 is 2287.710. Average linkage is the average of 3 -> 6 (1493.246) and 4 -> 6 (2287.710), which is 1890.478.

Once 3, 4, and 6 are clustered together, the process continues with the next closest value until each point is clustered.

Euclidean distance of two vectors

How about calculating the Euclidean distance of a vector?

set.seed(31)
x <- sample(1:10000, 7)

set.seed(32)
y <- sample(1:10000, 7)

set.seed(33)
z <- sample(1:10000, 7)
test <- data.frame(x,y,z)
test
     x    y    z
1 5222 5059 4460
2 9499 5948 3947
3 4269 8086 4837
4 4244 7287 9187
5 9444 1520 8436
6 4520 9558 5171
7 8322 7531 4369

dist(test)
         1        2        3        4        5        6
2 4398.434                                             
3 3195.789 5719.794                                    
4 5316.484 7540.925 4422.841                           
5 6793.996 6305.659 9101.966 7801.429                  
6 4608.614 6270.623 1530.144 4621.891 9975.743         
7 3965.989 2017.251 4117.506 6316.862 7343.807 4382.595

# define function for calculating Euclidean distance
euclidean_distance <- function(x1,x2){
   sqrt(sum((x1 - x2) ^ 2))
}

# calculate distance from 1 to 2
euclidean_distance(test[1,], test[2,])
[1] 4398.434

Parallelisation

The amap package parallelises the distance calculation step and the hierarchical clustering. This is extremely useful when you have a large dataset.

#install
install.packages("amap")
#load
library("amap")
#adjust nbproc to the number of available processors
dist <- Dist(big_df, nbproc=30)
hc <- hcluster(dist, nbproc=30)

This work is licensed under a Creative Commons
Attribution 4.0 International License.

Euclidean distance of two vectors

Parallelisation

Leave a Reply Cancel reply