# Hierarchical clustering

Hierarchical clustering is a cluster analysis method that builds a hierarchy of clusters. There are two types of strategies, agglomerative and divisive. This post is about agglomerative, the bottom up approach, hierarchical clustering using the USArrests data set.

```dim(USArrests)
 50  4

Murder Assault UrbanPop Rape
Alabama      13.2     236       58 21.2
Arizona       8.1     294       80 31.0
Arkansas      8.8     190       50 19.5
California    9.0     276       91 40.6

plot(USArrests)
``` You can see some correlation between assault and murder.

Prettier plot that also shows Pearson correlations using GGally.

```install.packages("GGally")
library(GGally)
ggpairs(USArrests)
``` Calculate the correlations.

```cor(USArrests\$Murder, USArrests\$Assault)
 0.8018733

cor(USArrests\$Murder, USArrests\$Assault, method = "spearman")
 0.8172735

cor(USArrests\$UrbanPop, USArrests\$Rape)
 0.4113412
```

Perform hierarchical clustering using the Euclidean distance and complete/maximum linkage.

```d <- dist(USArrests, method="euclidean")
h <- hclust(d, method="complete")
plot(h)
``` Let’s break down the entire process; this is how the Euclidean distance is calculated.

```set.seed(31)
x <- sample(1:10000, 7)
x
 5222 9499 4269 4244 9444 4520 8322

set.seed(32)
y <- sample(1:10000, 7)
y
 5059 5948 8086 7287 1520 9558 7531

test <- data.frame(x, y)

# calculate Euclidean distance
dist(test)
1        2        3        4        5        6
2 4368.415
3 3173.474 5650.128
4 2433.201 5422.909  799.391
5 5509.066 4428.342 8360.202 7765.197
6 4553.439 6150.003 1493.246 2287.710 9426.305
7 3964.944 1972.617 4090.823 4085.293 6114.818 4308.588

plot(test, type = "n")
text(test)
``` We can see that some points are closer to each other.

The distance formula in Cartesian coordinates is derived from the Pythagorean theorem. Let’s work out the distance from point 2 to 7.

```test
x    y
1 5222 5059
2 9499 5948
3 4269 8086
4 4244 7287
5 9444 1520
6 4520 9558
7 8322 7531

sqrt((9499 - 8322)^2 + (5948 - 7531)^2)
 1972.617

# the above result is the same as the result from dist()
dist(test)
1        2        3        4        5        6
2 4368.415
3 3173.474 5650.128
4 2433.201 5422.909  799.391
5 5509.066 4428.342 8360.202 7765.197
6 4553.439 6150.003 1493.246 2287.710 9426.305
7 3964.944 1972.617 4090.823 4085.293 6114.818 4308.588

# from the plot above we can see that 3 and 4 are the closest
# this is also the case in the distance matrix above
```

Now let’s compare the different agglomeration methods.

```op<-par(mfrow=c(1,3))
plot(hclust(dist(test), method="single"))
plot(hclust(dist(test), method="average"))
plot(hclust(dist(test), method="complete"))
``` Hierarchical clustering first takes the two closest points, which in this case is 3 and 4, and finds the closest point to 1 and 3, which is 6. However, depending on the agglomeration method, the distances are calculated in a different manner.

In single linkage clustering once 3 and 4 are clustered together, the distance to 6 is either the distance from 3 to 6 or 4 to 6, depending on which is closer. 3 -> 6 is 1493.246 and 4 to 6 is 2287.710, so the distance of 6 to 1 and 3 is 1493.246. Complete linkage is the opposite of single linkage in that the furthest distance is used, so the distance of 6 to 1 and 3 is 2287.710. Average linkage is the average of 3 -> 6 (1493.246) and 4 -> 6 (2287.710), which is 1890.478.

Once 3, 4, and 6 are clustered together, the process continues with the next closest value until each point is clustered.

### Euclidean distance of two vectors

How about calculating the Euclidean distance of a vector?

```set.seed(31)
x <- sample(1:10000, 7)

set.seed(32)
y <- sample(1:10000, 7)

set.seed(33)
z <- sample(1:10000, 7)
test <- data.frame(x,y,z)
test
x    y    z
1 5222 5059 4460
2 9499 5948 3947
3 4269 8086 4837
4 4244 7287 9187
5 9444 1520 8436
6 4520 9558 5171
7 8322 7531 4369

dist(test)
1        2        3        4        5        6
2 4398.434
3 3195.789 5719.794
4 5316.484 7540.925 4422.841
5 6793.996 6305.659 9101.966 7801.429
6 4608.614 6270.623 1530.144 4621.891 9975.743
7 3965.989 2017.251 4117.506 6316.862 7343.807 4382.595

# define function for calculating Euclidean distance
euclidean_distance <- function(x1,x2){
sqrt(sum((x1 - x2) ^ 2))
}

# calculate distance from 1 to 2
euclidean_distance(test[1,], test[2,])
 4398.434
```

### Parallelisation

The amap package parallelises the distance calculation step and the hierarchical clustering. This is extremely useful when you have a large dataset.

```#install
install.packages("amap")
library("amap")
#adjust nbproc to the number of available processors
dist <- Dist(big_df, nbproc=30)
hc <- hcluster(dist, nbproc=30)
``` 