K means clustering

Updated: 2014 March 13th

From Wikipedia:

k-means clustering is a method of vector quantization, originally from signal processing, that is popular for cluster analysis in data mining. k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster.

I will perform k-means clustering following the example provided at http://www.statmethods.net/advstats/cluster.html and using the wine dataset, which I previously analysed using random forests.

data <- read.table('http://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data',
                   header=F,
                   sep=',')

names(data) <- c('Class',
                 'Alcohol',
                 'Malic acid',
                 'Ash',
                 'Alcalinity of ash',
                 'Magnesium',
                 'Total phenols',
                 'Flavanoids',
                 'Nonflavanoid phenols',
                 'Proanthocyanins',
                 'Color intensity',
                 'Hue',
                 'OD280/OD315 of diluted wines',
                 'Proline')

#how many different classes?
table(data$Class)

 1  2  3 
59 71 48

rownames(data) <- paste(rownames(data), '_', data$Class, sep="")
data <- data[,-1]
head(data)
    Alcohol Malic acid  Ash Alcalinity of ash Magnesium Total phenols Flavanoids Nonflavanoid phenols
1_1   14.23       1.71 2.43              15.6       127          2.80       3.06                 0.28
2_1   13.20       1.78 2.14              11.2       100          2.65       2.76                 0.26
3_1   13.16       2.36 2.67              18.6       101          2.80       3.24                 0.30
4_1   14.37       1.95 2.50              16.8       113          3.85       3.49                 0.24
5_1   13.24       2.59 2.87              21.0       118          2.80       2.69                 0.39
6_1   14.20       1.76 2.45              15.2       112          3.27       3.39                 0.34
    Proanthocyanins Color intensity  Hue OD280/OD315 of diluted wines Proline
1_1            2.29            5.64 1.04                         3.92    1065
2_1            1.28            4.38 1.05                         3.40    1050
3_1            2.81            5.68 1.03                         3.17    1185
4_1            2.18            7.80 0.86                         3.45    1480
5_1            1.82            4.32 1.04                         2.93     735
6_1            1.97            6.75 1.05                         2.85    1450

wss <- (nrow(data)-1)*sum(apply(data,2,var))
for (i in 2:15)
   wss[i] <- sum(kmeans(data, centers=i)$withinss)
plot(1:15, wss, type="b", xlab="Number of Clusters",
   ylab="Within groups sum of squares")

number_of_clusters

5 seems to be a good number of paritions based on the within groups sum of squares.

fit <- kmeans(data, 5)
aggregate(data,by=list(fit$cluster),FUN=mean)
  Group.1  Alcohol Malic acid      Ash Alcalinity of ash Magnesium Total phenols Flavanoids
1       1 13.52750   1.925938 2.370938          17.72500 106.50000      2.725000   2.742500
2       2 12.35871   2.160000 2.240000          20.26774  90.77419      2.324516   2.114839
3       3 13.86000   1.793913 2.506957          17.07391 106.00000      2.943043   3.110870
4       4 12.94184   2.600204 2.385306          19.72245 103.10204      2.077959   1.531837
5       5 12.67860   2.758372 2.357907          21.29070  94.00000      1.854884   1.425116
  Nonflavanoid phenols Proanthocyanins Color intensity       Hue OD280/OD315 of diluted wines   Proline
1            0.2887500        1.875938        4.988750 1.0426875                     3.089062 1017.4375
2            0.3593548        1.571290        3.166129 1.0303226                     2.742581  383.4516
3            0.2986957        1.926087        6.260000 1.1000000                     3.035652 1338.5652
4            0.3908163        1.483878        5.714082 0.8728571                     2.342653  713.1020
5            0.4188372        1.335581        5.083256 0.8616279                     2.241860  529.6047

data_fit <- data.frame(data, fit$cluster)

library(cluster)
clusplot(data_fit, fit$cluster, color=T, shade=T, labels=2, lines=0, cex=0.4)

clusplot

We know that there are three classes of wines, what would our results look like if we chose three clusters?

fit <- kmeans(data, 3)
data_fit <- data.frame(data, fit$cluster)

#how many of the class 1 wines were clustered into the same cluster?
table(data_fit[grep("_1", row.names(data_fit)),]$fit.cluster)

 1  2 
13 46

#how many of the class 2 wines were clustered into the same cluster?
table(data_fit[grep("_2", row.names(data_fit)),]$fit.cluster)

 1  2  3 
20  1 50

#how many of the class 3 wines were clustered into the same cluster?
table(data_fit[grep("_3", row.names(data_fit)),]$fit.cluster)

 1  3 
29 19
Print Friendly, PDF & Email



Creative Commons License
This work is licensed under a Creative Commons
Attribution 4.0 International License
.
2 comments Add yours
  1. Hi Davo –

    I’m getting into means clustering, and I can’t find a good reference for what component 1 and component 2 in clusplot correspond to — I know they’re from a PCA, but I can’t figure out what they mean — are they the principal components that cause the groups to split (and if that’s the case, how do you know which two components they are), or are they the first two components in your data set (i.e. alcohol and malic acid) ? Do you have a good reference to explain this? thanks so much!

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.