# Clustering

Jump to navigation
Jump to search

A form of exploratory data analysis (EDA) where observations are divided into meaningful groups that share common characteristics (features).

Distance = 1 - similarity (closer to 1 if more similar)

If distance is small, they are more similar and vice versa.

Measuring distance for categorical data - use Jaccard index

dist(categorical_data, method = "binary")

If there are more than two categories, dummify the data.

library(dummires) dummy.data.frame(data)

Cluster stability: https://www.coursera.org/lecture/cluster-analysis/6-9-cluster-stability-65y3a

Cluster stability overview: https://davetang.org/file/luxburg_ftml.pdf

Cluster Analysis in R - http://www.sthda.com/english/wiki/print.php?id=234

## Metrics

- Connectivity - for each observation, gather L nearest neighbours; for each nearest neighbour, add zero if they belong to the same cluster as the current observation or add 1/L. The metric is between zero and infinity and should be minimised
- Silhouette Width is the average of each observation's Silhouette value. The Silhouette value measures the degree of confidence in the clustering assignment of a particular observation. Take the average distance between an observation to all other observations in the same cluster and compare it to the average distance of an observation to all other observations in the nearest neighbouring cluster
- Dunn Index - ratio of the smallest distance between observations not in the same cluster to the largest intra-cluster distance