Principal Component Analysis

From Dave's wiki
Jump to navigation Jump to search

Principal Component Analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components.[1]

PCA finds the principal components, the underlying structure in the data. They are the directions where there is the most variance, the directions where the data is most spread out.[2]

It accomplishes this reduction by identifying directions, called principal components, along which the variation in the data is maximal. PCA identifies new variables, the principal components, which are linear combinations of the original variables.[3]

Correlation indicates that there is redundancy in the data, therefore we can simplify the data by replacing a group of correlated variables with a new single variable. PCA creates a new set of variables called principal components, which are a linear combination of the original variables.

[4] [5] [6] [7] [8] [9] [10] [11] [12]


Imagine conducting an experiment and measuring various variables that result from the experiment to help understanding the dynamics of the system.

Why it may be possible to reduce dimensions?

When we have correlations (multicollinarity) between the x-variables, the data may more or less fall on a line or plane in a lower number of dimensions. For instance, imagine a plot of two x-variables that have a nearly perfect correlation. The data points will fall close to a straight line and that line could be used as a new (one-dimensional) axis to represent the variation among data points. As another example, suppose that we have verbal, math, and total SAT scores for a sample of students. We have three variables, but really (at most) two dimensions to the data because total = verbal +math, meaning the third variable is completely determined by the first two. The reason for saying "at most" two dimensions is that if there is a strong correlation between verbal and math, then it may be possible that there is only one true dimension to the data. See

Background mathematics

Background mathematics necessary for understanding PCA.[13]

  • Standard deviation
  • Covariance
  • Eigenvectors
  • Eigenvalues

Standard deviation and variance

The Standard Deviation (SD) of a data set is a measure of how spread out the data is. The English definition of the SD is: “The average distance from the mean of the data set to a point”. The way to calculate it is to compute the squares of the distance from each data point to the mean of the set, add them all up, divide by n-1, and take the positive square root. Variance is another measure of the spread of data in a data set and is simply the SD squared.


Standard deviation and variance only operate on one dimension, so that you could only calculate the standard deviation for each dimension of the data set independently of the other dimensions. However, it is useful to have a similar measure to find out how much the dimensions vary from the mean with respect to each other. Covariance is always measured between 2 dimensions. If you calculate the covariance between one dimension and itself, you get the variance. If you had a 3-dimensional data set (x, y, z), then you could measure the covariance between x and y dimensions, the x and z dimensions, and the y and z dimensions. The English definition of covariance is “For each data item, multiply the difference between the x value and the mean of x, by the the difference between the y value and the mean of y. Add all these up, and divide by n-1”.

In R:

  height weight
1     58    115
2     59    117
3     60    120
4     61    123
5     62    126
6     63    129
       height   weight
height     20  69.0000
weight     69 240.2095
          height    weight
height 1.0000000 0.9954948
weight 0.9954948 1.0000000

a <- sample(min(women$height):max(women$height), nrow(women))
b <- sample(min(women$weight):max(women$weight), nrow(women))
blah <- data.frame(height = a, weight = b)
  height weight
1     65    120
2     71    119
3     63    137
4     70    125
5     68    163
6     62    121
          height    weight
height  20.00000 -12.21429
weight -12.21429 254.60000
           height     weight
height  1.0000000 -0.1711685
weight -0.1711685  1.0000000

The covariance values indicate whether both dimensions increase together (positive value seen in the women dataset), are independent of each other (zero value), or as one increases, the other decreases (negative value in the random dataset I created).

Eigenvectors and eigenvalues

We can deconstruct data into eigenvectors and eigenvalues, which always exist in pairs; every eigenvector has a corresponding eigenvalue. An eigenvector is a direction and an eigenvalue is a number that indicates how much variance is in the data in that direction, i.e. the eigenvalue is a number indicating the spread of the data in that direction. The eigenvector with the highest eigenvalue is the principal component. The number of eigenvectors/values that exists in a data set is equal to the number of dimensions of the data. The reason for this is that eigenvectors put the data into a new set of dimensions, and these new dimensions have to be equal to the original amount of dimensions.


An eigenvalue is a number, telling you how much variance there is in the data in that direction, in the example above the eigenvalue is a number telling us how spread out the data is on the line.

Dimension reduction

When the number of variables is larger than the number of samples, PCA can reduce the dimensionality of the samples to, at most, the number of samples, without loss of information.[3]

R packages


Using base R.

iris.pca <- prcomp(x = iris[,-5], scale. = TRUE)

[1] "sdev"     "rotation" "center"   "scale"    "x"

Standard deviations (1, .., p=4):
[1] 1.7083611 0.9560494 0.3830886 0.1439265

Rotation (n x k) = (4 x 4):
                    PC1         PC2        PC3        PC4
Sepal.Length  0.5210659 -0.37741762  0.7195664  0.2612863
Sepal.Width  -0.2693474 -0.92329566 -0.2443818 -0.1235096
Petal.Length  0.5804131 -0.02449161 -0.1421264 -0.8014492
Petal.Width   0.5648565 -0.06694199 -0.6342727  0.5235971

Importance of components:
                          PC1    PC2     PC3     PC4
Standard deviation     1.7084 0.9560 0.38309 0.14393
Proportion of Variance 0.7296 0.2285 0.03669 0.00518
Cumulative Proportion  0.7296 0.9581 0.99482 1.00000

The rotation slot contains the matrix of variable loadings (i.e., a matrix whose columns contain the eigenvectors). The function princomp returns this in the element loadings.

Using ade4.

dudi.pca() is the main function that implements PCA for ade4 and by default, it is interactive: It lets the user insert the number of retained dimensions.


# Run a PCA using the 10 non-binary numeric variables.
cars_pca <- dudi.pca(cars[,9:19], scannf = FALSE, nf = 4)

# Explore the summary of cars_pca.

Further reading


  1. Principal Component Analysis on Wikipedia
  2. Principal Component Analysis 4 Dummies: Eigenvectors, Eigenvalues and Dimension Reduction
  3. 3.0 3.1 What is principal component analysis
  4. Implementing a Principal Component Analysis (PCA) in Python step by step
  5. Step by step principal components analysis using R
  6. Explaining PCA to a school child
  7. Singular Value Decomposition Tutorial
  8. What is principal component analysis?
  9. Principal Component Analysis Explained Visually
  10. Singular value decomposition and principal component analysis
  11. Principal Component Analysis in R
  12. Does each eigenvalue in PCA correspond to one particular original variable?
  13. A tutorial on Principal Components Analysis