Step by step Principal Component Analysis using R

I’ve always wondered what goes on behind the scenes of a Principal Component Analysis (PCA). I found this extremely useful tutorial that explains the key concepts of PCA and shows the step by step calculations. Here, I use R to perform each step of a PCA as per the tutorial.

# use a simple two dimensional dataset to illustrate PCA
x <- c(2.5, 0.5, 2.2, 1.9, 3.1, 2.3, 2, 1, 1.5, 1.1)
y <- c(2.4, 0.7, 2.9, 2.2, 3.0, 2.7, 1.6, 1.1, 1.6, 0.9)

plot(x, y, pch = 19)

Our dataset visualised on the x-y coordinates.

Next we need to work out the mean of each dimension and subtract it from each value from the respective dimensions. This is known as standardisation, where the dimensions now have a mean of zero.

mean(x)
# [1] 1.81
mean(y)
# [1] 1.91

x1 <- x - mean(x)
x1
# [1]  0.69 -1.31  0.39  0.09  1.29  0.49  0.19 -0.81 -0.31 -0.71
summary(x1)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 -1.310  -0.610   0.140   0.000   0.465   1.290

y1 <- y - mean(y)
y1
# [1]  0.49 -1.21  0.99  0.29  1.09  0.79 -0.31 -0.81 -0.31 -1.01
summary(y1)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 -1.210  -0.685  -0.010   0.000   0.715   1.090

plot(x1, y1, pch = 19)

Our standardised dataset visualised on the x-y coordinates.

The next step is to calculate the covariance matrix. Covariance measures how dimensions vary with respect to each other and the covariance matrix contains all covariance measures between all dimensions.

cov(x1, y1)
#[1] 0.6154444

cov(x1, x1)
#[1] 0.6165556

cov(y1, y1)
#[1] 0.7165556

m <- matrix(c(cov(x1, x1), cov(x1, y1), cov(y1, x1),cov(y1, y1)),
            nrow=2,
            ncol=2,
            byrow=TRUE,
            dimnames=list(c("x","y"),c("x","y")))

m
#           x         y
# x 0.6165556 0.6154444
# y 0.6154444 0.7165556

Next we need to find the eigenvector and eigenvalues of the covariance matrix. An eigenvector is a direction and an eigenvalue is a number that indicates how much variance is in the data in that direction.

e <- eigen(m)
e
# eigen() decomposition
# $values
# [1] 1.2840277 0.0490834
# 
# $vectors
#           [,1]       [,2]
# [1,] 0.6778734 -0.7351787
# [2,] 0.7351787  0.6778734

The largest eigenvalue is the first principal component; we multiply the standardised values to the first eigenvector, which is stored in e$vectors[,1].

pc1 <- x1 * e$vectors[1,1] + y1 * e$vectors[2,1]
pc1
#  [1]  0.82797019 -1.77758033  0.99219749  0.27421042  1.67580142  0.91294910 -0.09910944 -1.14457216 -0.43804614
# [10] -1.22382056

pc2 <- x1 * e$vectors[1,2] + y1 * e$vectors[2,2]
pc2
#  [1] -0.17511531  0.14285723  0.38437499  0.13041721 -0.20949846  0.17528244 -0.34982470  0.04641726  0.01776463
# [10] -0.16267529

data.frame(PC1 = pc1, PC2 = pc2)
#            PC1         PC2
# 1   0.82797019 -0.17511531
# 2  -1.77758033  0.14285723
# 3   0.99219749  0.38437499
# 4   0.27421042  0.13041721
# 5   1.67580142 -0.20949846
# 6   0.91294910  0.17528244
# 7  -0.09910944 -0.34982470
# 8  -1.14457216  0.04641726
# 9  -0.43804614  0.01776463
# 10 -1.22382056 -0.16267529

plot(pc1, pc2, pch = 19)

Our standardised dataset visualised on the first and second eigenvectors.

Now to perform PCA using the prcomp() function.

data <- data.frame(x,y)
data.pca <- prcomp(data)
data.pca
# Standard deviations (1, .., p=2):
# [1] 1.1331495 0.2215477
# 
# Rotation (n x k) = (2 x 2):
#          PC1        PC2
# x -0.6778734  0.7351787
# y -0.7351787 -0.6778734

names(data.pca)
# [1] "sdev"     "rotation" "center"   "scale"    "x"

data.pca$x
#              PC1         PC2
# [1,] -0.82797019  0.17511531
# [2,]  1.77758033 -0.14285723
# [3,] -0.99219749 -0.38437499
# [4,] -0.27421042 -0.13041721
# [5,] -1.67580142  0.20949846
# [6,] -0.91294910 -0.17528244
# [7,]  0.09910944  0.34982470
# [8,]  1.14457216 -0.04641726
# [9,]  0.43804614 -0.01776463
#[10,]  1.22382056  0.16267529

plot(data.pca$x[,1], data.pca$x[,2], pch = 19)

Our standardised dataset visualised on the first and second eigenvectors.

I’m unsure why the eigen() function returns eigenvectors that have inverted signs with respect to the eigenvectors returned by prcomp().

eigen(m)
# eigen() decomposition
# $values
# [1] 1.2840277 0.0490834
# 
# $vectors
#           [,1]       [,2]
# [1,] 0.6778734 -0.7351787
# [2,] 0.7351787  0.6778734

data.pca
# Standard deviations (1, .., p=2):
# [1] 1.1331495 0.2215477
# 
# Rotation (n x k) = (2 x 2):
#          PC1        PC2
# x -0.6778734  0.7351787
# y -0.7351787 -0.6778734

What is even weirder is that the eigenvectors calculated in the tutorial are a combination of values calculated from prcomp() and eigen().

Print Friendly, PDF & Email



Creative Commons License
This work is licensed under a Creative Commons
Attribution 4.0 International License
.
7 comments Add yours
  1. The sign is meaningless here. It’s just chosen at random at the beginning . YOu are absolutely right, it does not change the relationship.

  2. Hi guys
    Thank you for four tutorial and comment,
    use correlation matrix instead of co-variance matrix.
    by default, R use correlation matrix in PCA computation.
    I checked it, the results are exactly as prncomp result.

    Thank you

  3. Hi again
    I’m sorry, I make a mistake in previous comment.
    Let me check again and I will send you the correction.
    Thank you.

  4. Thanks for the wonderful post.
    I was actually reading the tutotial by Lindsay,but I wanted to implement it in R.
    I was fortunate to find your post as you have used the same data used by the tutorial. Helped me a lot.

  5. Hi,

    Thank you for this simple meaningful tutorial. I have several simple questions.

    Does PCA mean transform existing data frame into new data frame?
    After new data frame constructed using PCA, you will need to choose n first columns as features right? Does it mean n first columns always be the first n important features?

    Please advise

    1. Does PCA mean transform existing data frame into new data frame?

      If you use prcomp(), the saved object is a prcomp object.

      x <- c(2.5, 0.5, 2.2, 1.9, 3.1, 2.3, 2, 1, 1.5, 1.1)
      y <- c(2.4, 0.7, 2.9, 2.2, 3.0, 2.7, 1.6, 1.1, 1.6, 0.9)
      my_data <- data.frame(x,y)
      data.pca <- prcomp(my_data)
      class(data.pca)
      [1] "prcomp"
      

      After new data frame constructed using PCA, you will need to choose n first columns as features right?

      Typically people choose the PCs that explain the most variance.

      summary(data.pca)
      Importance of components%s:
                                PC1     PC2
      Standard deviation     1.1331 0.22155
      Proportion of Variance 0.9632 0.03682
      Cumulative Proportion  0.9632 1.00000
      

      The first PC explains 96% of the variance.

      Does it mean n first columns always be the first n important features?

      Yes.

      head(data.pca$x)
                  PC1        PC2
      [1,] -0.8279702  0.1751153
      [2,]  1.7775803 -0.1428572
      [3,] -0.9921975 -0.3843750
      [4,] -0.2742104 -0.1304172
      [5,] -1.6758014  0.2094985
      [6,] -0.9129491 -0.1752824
      

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.