Explaining PCA to a school child

Ed Yong asked on Twitter “Explain principal component analysis to a schoolchild in a tweet.” Since I can’t explain PCA eloquently, I found this interesting and wanted to keep a record of the replies for future reference. Here are some of the modified replies, with my favourite first (and the rest in no particular order):

• Something nerds do to show off and that the rest of us secretly don't understand but always look knowingly when it's mentioned. @richboden
• Compare the height, eye color, hair length, etc. of the people around you. Where do you notice the most difference? @holtchesley
• It's a way to take any clues or observations you have about something and use math to see how much you can figure out about it. @ShipLives
• Capturing the essence of a data set in a reduced form. Think of describing the premiership in terms of Man U, Chelsea and Arsenal @fastconvergence
• Imagine lining the crayons in a box up based on their color similarity. Now, in the other direction, separate them by the wear. @APV2600
• "If you have to talk about something really complicated but you can only use two or three numbers, what do you do?" @blakestacey
• PCA is like picking cookies with the most chocolate chips and seeing where they came from in the jar. @holy_kau
• If you have a jellybean-shaped cloud of points, PCA figures out long side and short side of cloud and squishes it to get a ball. @geomblog
• PCA is a way of finding basic similarities between many different things - linking complexities by their simplest similarities @aimeemax
• PCA is a math method that converts data into a map. Objects (countries) that are closeby have similar parameters (coordinates) @scimomof2
• It's like representing a road with a line on a map. The road has width and length, but the length is sufficient for most purposes @CaldenWloka
• PCA is like taking all your responses to 'what is PCA' and condensing their essence into a single retweet. @ErikJCox
• A colleague once summed up PCA as a way of letting data decide for itself which patterns best describe it. @J_Liptak
• PCA tells you which features best describe the differences between items in a group, like colour in a box of lollies @pickleswarlz
• PCA of leaf shape would first give size, then fatness (width:length), wiggles of leaf edge, etc. 1st parts give largest variation @mickresearch
• Imagine collecting all the rocks in the yard, but you can only keep 10. PCA picks the ones with the most common color & sizes @kristenobacter
• Best soccer teams: do they score more or have the best defense ? PCA ranks values of teams based on multiple data of this kind. @tomroud

PCA and rggobi

#generate row names
rname <- c(1:10000)
#generate column names
cname <- c(rep("control",20),rep("test",20))
#initialise array with 10,000 rows and 40 columns
data <- array(0,dim=c(10000,40),dimnames=list(rname,cname))
for (i in 1:nrow(data)){
   x_mean <- sample(1:10000,1)
   x_sd <- sample(1:1000,1)
   y_mean <- sample(1:10000,1)
   y_sd <- sample(1:1000,1)
   x <- rnorm(20,mean=x_mean,sd=x_sd)
   y <- rnorm(20,mean=y_mean,sd=y_sd)
   for (j in 1:20){
      data[i,j] <- x[j]
   }
   for (k in 1:20){
      k_1 <- 20 + k
      data[i,k_1] <- y[k]
   }
}
data.pca <- prcomp(data)
x <- data.pca$rotation[,"PC1"]
y <- data.pca$rotation[,"PC2"]
xy <- data.frame(x,y)
library(rggobi)
g <- ggobi(xy)


I labelled only two samples since the text would overlap. Samples 1 to 20 are located near the 16 and samples 21 to 40 near the 37, as expected.

Step by step principal components analysis using R

I've always wondered about the calculations behind a principal components analysis (PCA). An extremely useful tutorial explains the key concepts and runs through the analysis. Here I use R to calculate each step of a PCA in hopes of better understanding the analysis.

x <- c(2.5, 0.5, 2.2, 1.9, 3.1, 2.3, 2, 1, 1.5, 1.1)
y <- c(2.4, 0.7, 2.9, 2.2, 3.0, 2.7, 1.6, 1.1, 1.6, 0.9)
mean(x)
#[1] 1.81
mean(y)
#[1] 1.91
x_less_mean <- x - mean(x)
x_less_mean
# [1]  0.69 -1.31  0.39  0.09  1.29  0.49  0.19 -0.81 -0.31 -0.71
y_less_mean <- y - mean(y)
y_less_mean
# [1]  0.49 -1.21  0.99  0.29  1.09  0.79 -0.31 -0.81 -0.31 -1.01
cov(x,y)
#[1] 0.6154444
cov(x,x)
#[1] 0.6165556
cov(y,y)
#[1] 0.7165556
covar_matrix <- matrix(c(cov(x,x),cov(x,y),cov(y,x),cov(y,y)),nrow=2,ncol=2,byrow=TRUE,dimnames=list(c("x","y"),c("x","y")))
covar_matrix
#          x         y
#x 0.6165556 0.6154444
#y 0.6154444 0.7165556
eigen(covar_matrix)
#$values
#[1] 1.2840277 0.0490834
#
#$vectors
#          [,1]       [,2]
#[1,] 0.6778734 -0.7351787
#[2,] 0.7351787  0.6778734
e <- eigen(covar_matrix)
#to get the first principal components
#multiple the first eigenvector to mean transformed x, y values
for (i in 1:length(x_less_mean)){
   pc1 <- (x_less_mean[i] * e$vectors[1,1]) + (y_less_mean[i] * e$vectors[2,1])
   print(pc1)
}
#[1] 0.8279702
#[1] -1.77758
#[1] 0.9921975
#[1] 0.2742104
#[1] 1.675801
#[1] 0.9129491
#[1] -0.09910944
#[1] -1.144572
#[1] -0.4380461
#[1] -1.223821
for (i in 1:length(x_less_mean)){
   pc2 <- (x_less_mean[i] * e$vectors[1,2]) + (y_less_mean[i] * e$vectors[2,2])
   print(pc2)
}
#[1] -0.1751153
#[1] 0.1428572
#[1] 0.384375
#[1] 0.1304172
#[1] -0.2094985
#[1] 0.1752824
#[1] -0.3498247
#[1] 0.04641726
#[1] 0.01776463
#[1] -0.1626753
z <- array(0, dim=c(10,2))
for (i in 1:length(x_less_mean)){
   pc1 <- (x_less_mean[i] * e$vectors[1,1]) + (y_less_mean[i] * e$vectors[2,1])
   pc2 <- (x_less_mean[i] * e$vectors[1,2]) + (y_less_mean[i] * e$vectors[2,2])
   z[i,1] <- pc1
   z[i,2] <- pc2
}
z
#             [,1]        [,2]
# [1,]  0.82797019 -0.17511531
# [2,] -1.77758033  0.14285723
# [3,]  0.99219749  0.38437499
# [4,]  0.27421042  0.13041721
# [5,]  1.67580142 -0.20949846
# [6,]  0.91294910  0.17528244
# [7,] -0.09910944 -0.34982470
# [8,] -1.14457216  0.04641726
# [9,] -0.43804614  0.01776463
#[10,] -1.22382056 -0.16267529
#now do this using the inbuilt prcomp() function
data <- data.frame(x,y)
data.pca <- prcomp(data)
data.pca
#Standard deviations:
#[1] 1.1331495 0.2215477
#
#Rotation:
#         PC1        PC2
#x -0.6778734  0.7351787
#y -0.7351787 -0.6778734
data.pca$x
#              PC1         PC2
# [1,] -0.82797019  0.17511531
# [2,]  1.77758033 -0.14285723
# [3,] -0.99219749 -0.38437499
# [4,] -0.27421042 -0.13041721
# [5,] -1.67580142  0.20949846
# [6,] -0.91294910 -0.17528244
# [7,]  0.09910944  0.34982470
# [8,]  1.14457216 -0.04641726
# [9,]  0.43804614 -0.01776463
#[10,]  1.22382056  0.16267529
op<-par(mfrow=c(1,2))
plot(z)
plot(data.pca$x)
#reset par()
par(mfrow=c(1,1))

I'm still trying to figure out why the signs are inverted between calculating the principal components manually vs. using prcomp(). I guess the relationships don't change but was curious as to why this was the case. However in the tutorial listed at the start of this post, first principal components are identical however the second principal components are inverted.