# Step by step Principal Component Analysis using R

I’ve always wondered what goes on behind the scenes of a Principal Component Analysis (PCA). I found this extremely useful tutorial that explains the key concepts of PCA and shows the step by step calculations. Here, I use R to perform each step of a PCA as per the tutorial.

```# use a simple two dimensional dataset to illustrate PCA
x <- c(2.5, 0.5, 2.2, 1.9, 3.1, 2.3, 2, 1, 1.5, 1.1)
y <- c(2.4, 0.7, 2.9, 2.2, 3.0, 2.7, 1.6, 1.1, 1.6, 0.9)

plot(x, y, pch = 19)
``` Our dataset visualised on the x-y coordinates.

Next we need to work out the mean of each dimension and subtract it from each value from the respective dimensions. This is known as standardisation, where the dimensions now have a mean of zero.

```mean(x)
#  1.81
mean(y)
#  1.91

x1 <- x - mean(x)
x1
#   0.69 -1.31  0.39  0.09  1.29  0.49  0.19 -0.81 -0.31 -0.71
summary(x1)
Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
-1.310  -0.610   0.140   0.000   0.465   1.290

y1 <- y - mean(y)
y1
#   0.49 -1.21  0.99  0.29  1.09  0.79 -0.31 -0.81 -0.31 -1.01
summary(y1)
Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
-1.210  -0.685  -0.010   0.000   0.715   1.090

plot(x1, y1, pch = 19)
``` Our standardised dataset visualised on the x-y coordinates.

The next step is to calculate the covariance matrix. Covariance measures how dimensions vary with respect to each other and the covariance matrix contains all covariance measures between all dimensions.

```cov(x1, y1)
# 0.6154444

cov(x1, x1)
# 0.6165556

cov(y1, y1)
# 0.7165556

m <- matrix(c(cov(x1, x1), cov(x1, y1), cov(y1, x1),cov(y1, y1)),
nrow=2,
ncol=2,
byrow=TRUE,
dimnames=list(c("x","y"),c("x","y")))

m
#           x         y
# x 0.6165556 0.6154444
# y 0.6154444 0.7165556
```

Next we need to find the eigenvector and eigenvalues of the covariance matrix. An eigenvector is a direction and an eigenvalue is a number that indicates how much variance is in the data in that direction.

```e <- eigen(m)
e
# eigen() decomposition
# \$values
#  1.2840277 0.0490834
#
# \$vectors
#           [,1]       [,2]
# [1,] 0.6778734 -0.7351787
# [2,] 0.7351787  0.6778734
```

The largest eigenvalue is the first principal component; we multiply the standardised values to the first eigenvector, which is stored in e\$vectors[,1].

```pc1 <- x1 * e\$vectors[1,1] + y1 * e\$vectors[2,1]
pc1
#    0.82797019 -1.77758033  0.99219749  0.27421042  1.67580142  0.91294910 -0.09910944 -1.14457216 -0.43804614
#  -1.22382056

pc2 <- x1 * e\$vectors[1,2] + y1 * e\$vectors[2,2]
pc2
#   -0.17511531  0.14285723  0.38437499  0.13041721 -0.20949846  0.17528244 -0.34982470  0.04641726  0.01776463
#  -0.16267529

data.frame(PC1 = pc1, PC2 = pc2)
#            PC1         PC2
# 1   0.82797019 -0.17511531
# 2  -1.77758033  0.14285723
# 3   0.99219749  0.38437499
# 4   0.27421042  0.13041721
# 5   1.67580142 -0.20949846
# 6   0.91294910  0.17528244
# 7  -0.09910944 -0.34982470
# 8  -1.14457216  0.04641726
# 9  -0.43804614  0.01776463
# 10 -1.22382056 -0.16267529

plot(pc1, pc2, pch = 19)
``` Our standardised dataset visualised on the first and second eigenvectors.

Now to perform PCA using the prcomp() function.

```data <- data.frame(x,y)
data.pca <- prcomp(data)
data.pca
# Standard deviations (1, .., p=2):
#  1.1331495 0.2215477
#
# Rotation (n x k) = (2 x 2):
#          PC1        PC2
# x -0.6778734  0.7351787
# y -0.7351787 -0.6778734

names(data.pca)
#  "sdev"     "rotation" "center"   "scale"    "x"

data.pca\$x
#              PC1         PC2
# [1,] -0.82797019  0.17511531
# [2,]  1.77758033 -0.14285723
# [3,] -0.99219749 -0.38437499
# [4,] -0.27421042 -0.13041721
# [5,] -1.67580142  0.20949846
# [6,] -0.91294910 -0.17528244
# [7,]  0.09910944  0.34982470
# [8,]  1.14457216 -0.04641726
# [9,]  0.43804614 -0.01776463
#[10,]  1.22382056  0.16267529

plot(data.pca\$x[,1], data.pca\$x[,2], pch = 19)
``` Our standardised dataset visualised on the first and second eigenvectors.

I’m unsure why the eigen() function returns eigenvectors that have inverted signs with respect to the eigenvectors returned by prcomp().

```eigen(m)
# eigen() decomposition
# \$values
#  1.2840277 0.0490834
#
# \$vectors
#           [,1]       [,2]
# [1,] 0.6778734 -0.7351787
# [2,] 0.7351787  0.6778734

data.pca
# Standard deviations (1, .., p=2):
#  1.1331495 0.2215477
#
# Rotation (n x k) = (2 x 2):
#          PC1        PC2
# x -0.6778734  0.7351787
# y -0.7351787 -0.6778734
```

What is even weirder is that the eigenvectors calculated in the tutorial are a combination of values calculated from prcomp() and eigen(). .
1. Stephan HOURDEZ says:

The sign is meaningless here. It’s just chosen at random at the beginning . YOu are absolutely right, it does not change the relationship.

1. Davo says:

Hi Stephan,

Thanks for the comment and the affirmation 🙂

Cheers,

Dave

2. Hossein says:

Hi guys
Thank you for four tutorial and comment,
use correlation matrix instead of co-variance matrix.
by default, R use correlation matrix in PCA computation.
I checked it, the results are exactly as prncomp result.

Thank you

3. Hossein says:

Hi again
I’m sorry, I make a mistake in previous comment.
Let me check again and I will send you the correction.
Thank you.

4. Arunava Datta says:

Thanks for the wonderful post.
I was actually reading the tutotial by Lindsay,but I wanted to implement it in R.
I was fortunate to find your post as you have used the same data used by the tutorial. Helped me a lot.

5. Chandra Sutrisno Tjhong says:

Hi,

Thank you for this simple meaningful tutorial. I have several simple questions.

Does PCA mean transform existing data frame into new data frame?
After new data frame constructed using PCA, you will need to choose n first columns as features right? Does it mean n first columns always be the first n important features?

1. Davo says:

Does PCA mean transform existing data frame into new data frame?

If you use prcomp(), the saved object is a prcomp object.

```x <- c(2.5, 0.5, 2.2, 1.9, 3.1, 2.3, 2, 1, 1.5, 1.1)
y <- c(2.4, 0.7, 2.9, 2.2, 3.0, 2.7, 1.6, 1.1, 1.6, 0.9)
my_data <- data.frame(x,y)
data.pca <- prcomp(my_data)
class(data.pca)
 "prcomp"
```

After new data frame constructed using PCA, you will need to choose n first columns as features right?

Typically people choose the PCs that explain the most variance.

```summary(data.pca)
Importance of components%s:
PC1     PC2
Standard deviation     1.1331 0.22155
Proportion of Variance 0.9632 0.03682
Cumulative Proportion  0.9632 1.00000
```

The first PC explains 96% of the variance.

Does it mean n first columns always be the first n important features?

Yes.

```head(data.pca\$x)
PC1        PC2
[1,] -0.8279702  0.1751153
[2,]  1.7775803 -0.1428572
[3,] -0.9921975 -0.3843750
[4,] -0.2742104 -0.1304172
[5,] -1.6758014  0.2094985
[6,] -0.9129491 -0.1752824
```

This site uses Akismet to reduce spam. Learn how your comment data is processed.