Updated: 2014 March 13th
From Wikipedia:
k-means clustering is a method of vector quantization, originally from signal processing, that is popular for cluster analysis in data mining. k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster.
I will perform k-means clustering following the example provided at http://www.statmethods.net/advstats/cluster.html and using the wine dataset, which I previously analysed using random forests.
data <- read.table('http://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data', header=F, sep=',') names(data) <- c('Class', 'Alcohol', 'Malic acid', 'Ash', 'Alcalinity of ash', 'Magnesium', 'Total phenols', 'Flavanoids', 'Nonflavanoid phenols', 'Proanthocyanins', 'Color intensity', 'Hue', 'OD280/OD315 of diluted wines', 'Proline') #how many different classes? table(data$Class) 1 2 3 59 71 48 rownames(data) <- paste(rownames(data), '_', data$Class, sep="") data <- data[,-1] head(data) Alcohol Malic acid Ash Alcalinity of ash Magnesium Total phenols Flavanoids Nonflavanoid phenols 1_1 14.23 1.71 2.43 15.6 127 2.80 3.06 0.28 2_1 13.20 1.78 2.14 11.2 100 2.65 2.76 0.26 3_1 13.16 2.36 2.67 18.6 101 2.80 3.24 0.30 4_1 14.37 1.95 2.50 16.8 113 3.85 3.49 0.24 5_1 13.24 2.59 2.87 21.0 118 2.80 2.69 0.39 6_1 14.20 1.76 2.45 15.2 112 3.27 3.39 0.34 Proanthocyanins Color intensity Hue OD280/OD315 of diluted wines Proline 1_1 2.29 5.64 1.04 3.92 1065 2_1 1.28 4.38 1.05 3.40 1050 3_1 2.81 5.68 1.03 3.17 1185 4_1 2.18 7.80 0.86 3.45 1480 5_1 1.82 4.32 1.04 2.93 735 6_1 1.97 6.75 1.05 2.85 1450 wss <- (nrow(data)-1)*sum(apply(data,2,var)) for (i in 2:15) wss[i] <- sum(kmeans(data, centers=i)$withinss) plot(1:15, wss, type="b", xlab="Number of Clusters", ylab="Within groups sum of squares")
5 seems to be a good number of paritions based on the within groups sum of squares.
fit <- kmeans(data, 5) aggregate(data,by=list(fit$cluster),FUN=mean) Group.1 Alcohol Malic acid Ash Alcalinity of ash Magnesium Total phenols Flavanoids 1 1 13.52750 1.925938 2.370938 17.72500 106.50000 2.725000 2.742500 2 2 12.35871 2.160000 2.240000 20.26774 90.77419 2.324516 2.114839 3 3 13.86000 1.793913 2.506957 17.07391 106.00000 2.943043 3.110870 4 4 12.94184 2.600204 2.385306 19.72245 103.10204 2.077959 1.531837 5 5 12.67860 2.758372 2.357907 21.29070 94.00000 1.854884 1.425116 Nonflavanoid phenols Proanthocyanins Color intensity Hue OD280/OD315 of diluted wines Proline 1 0.2887500 1.875938 4.988750 1.0426875 3.089062 1017.4375 2 0.3593548 1.571290 3.166129 1.0303226 2.742581 383.4516 3 0.2986957 1.926087 6.260000 1.1000000 3.035652 1338.5652 4 0.3908163 1.483878 5.714082 0.8728571 2.342653 713.1020 5 0.4188372 1.335581 5.083256 0.8616279 2.241860 529.6047 data_fit <- data.frame(data, fit$cluster) library(cluster) clusplot(data_fit, fit$cluster, color=T, shade=T, labels=2, lines=0, cex=0.4)
We know that there are three classes of wines, what would our results look like if we chose three clusters?
fit <- kmeans(data, 3) data_fit <- data.frame(data, fit$cluster) #how many of the class 1 wines were clustered into the same cluster? table(data_fit[grep("_1", row.names(data_fit)),]$fit.cluster) 1 2 13 46 #how many of the class 2 wines were clustered into the same cluster? table(data_fit[grep("_2", row.names(data_fit)),]$fit.cluster) 1 2 3 20 1 50 #how many of the class 3 wines were clustered into the same cluster? table(data_fit[grep("_3", row.names(data_fit)),]$fit.cluster) 1 3 29 19
This work is licensed under a Creative Commons
Attribution 4.0 International License.
Hi Davo –
I’m getting into means clustering, and I can’t find a good reference for what component 1 and component 2 in clusplot correspond to — I know they’re from a PCA, but I can’t figure out what they mean — are they the principal components that cause the groups to split (and if that’s the case, how do you know which two components they are), or are they the first two components in your data set (i.e. alcohol and malic acid) ? Do you have a good reference to explain this? thanks so much!
Hi Cait,
if you type:
library(cluster)
?clusplot.default
there will be more information as well as some references (for example, http://www.sciencedirect.com/science/article/pii/S0167947398001029).
Cheers,
Dave