Principal Component Analysis

From Dave's wiki
Revision as of 03:54, 20 October 2022 by Admin (talk | contribs) (→‎Variance explained)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Principal Component Analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components.[1]

PCA finds the principal components, the underlying structure in the data. They are the directions where there is the most variance, the directions where the data is most spread out.[2]

It accomplishes this reduction by identifying directions, called principal components, along which the variation in the data is maximal. PCA identifies new variables, the principal components, which are linear combinations of the original variables.[3]

Correlation indicates that there is redundancy in the data, therefore we can simplify the data by replacing a group of correlated variables with a new single variable. PCA creates a new set of variables called principal components, which are a linear combination of the original variables.

[4] [5] [6] [7] [8] [9] [10] [11] [12]

Intuition

Imagine conducting an experiment and measuring various variables that result from the experiment to help understanding the dynamics of the system.

https://davetang.org/file/PCA-Tutorial-Intuition_jp.pdf

Why it may be possible to reduce dimensions?

When we have correlations (multicollinarity) between the x-variables, the data may more or less fall on a line or plane in a lower number of dimensions. For instance, imagine a plot of two x-variables that have a nearly perfect correlation. The data points will fall close to a straight line and that line could be used as a new (one-dimensional) axis to represent the variation among data points. As another example, suppose that we have verbal, math, and total SAT scores for a sample of students. We have three variables, but really (at most) two dimensions to the data because total = verbal +math, meaning the third variable is completely determined by the first two. The reason for saying "at most" two dimensions is that if there is a strong correlation between verbal and math, then it may be possible that there is only one true dimension to the data. See https://onlinecourses.science.psu.edu/stat505/node/52.

Background mathematics

Background mathematics necessary for understanding PCA.[13]

  • Standard deviation
  • Covariance
  • Eigenvectors
  • Eigenvalues

Standard deviation and variance

The Standard Deviation (SD) of a data set is a measure of how spread out the data is. The English definition of the SD is: “The average distance from the mean of the data set to a point”. The way to calculate it is to compute the squares of the distance from each data point to the mean of the set, add them all up, divide by n-1, and take the positive square root. Variance is another measure of the spread of data in a data set and is simply the SD squared.

Covariance

Standard deviation and variance only operate on one dimension, so that you could only calculate the standard deviation for each dimension of the data set independently of the other dimensions. However, it is useful to have a similar measure to find out how much the dimensions vary from the mean with respect to each other. Covariance is always measured between 2 dimensions. If you calculate the covariance between one dimension and itself, you get the variance. If you had a 3-dimensional data set (x, y, z), then you could measure the covariance between x and y dimensions, the x and z dimensions, and the y and z dimensions. The English definition of covariance is “For each data item, multiply the difference between the x value and the mean of x, by the the difference between the y value and the mean of y. Add all these up, and divide by n-1”.

In R:

head(women)
  height weight
1     58    115
2     59    117
3     60    120
4     61    123
5     62    126
6     63    129
cov(women)
       height   weight
height     20  69.0000
weight     69 240.2095
cor(women)
          height    weight
height 1.0000000 0.9954948
weight 0.9954948 1.0000000

set.seed(31)
a <- sample(min(women$height):max(women$height), nrow(women))
b <- sample(min(women$weight):max(women$weight), nrow(women))
blah <- data.frame(height = a, weight = b)
head(blah)
  height weight
1     65    120
2     71    119
3     63    137
4     70    125
5     68    163
6     62    121
cov(blah)
          height    weight
height  20.00000 -12.21429
weight -12.21429 254.60000
cor(blah)
           height     weight
height  1.0000000 -0.1711685
weight -0.1711685  1.0000000

The covariance values indicate whether both dimensions increase together (positive value seen in the women dataset), are independent of each other (zero value), or as one increases, the other decreases (negative value in the random dataset I created).

Eigenvectors and eigenvalues

We can deconstruct data into eigenvectors and eigenvalues, which always exist in pairs; every eigenvector has a corresponding eigenvalue. An eigenvector is a direction and an eigenvalue is a number that indicates how much variance is in the data in that direction, i.e. the eigenvalue is a number indicating the spread of the data in that direction. The eigenvector with the highest eigenvalue is the principal component. The number of eigenvectors/values that exists in a data set is equal to the number of dimensions of the data. The reason for this is that eigenvectors put the data into a new set of dimensions, and these new dimensions have to be equal to the original amount of dimensions.

eigenvalues_and_eigenvectors.png

An eigenvalue is a number, telling you how much variance there is in the data in that direction, in the example above the eigenvalue is a number telling us how spread out the data is on the line.

http://setosa.io/ev/eigenvectors-and-eigenvalues/

Dimension reduction

When the number of variables is larger than the number of samples, PCA can reduce the dimensionality of the samples to, at most, the number of samples, without loss of information.[3]

Variance explained

https://stats.stackexchange.com/questions/22569/pca-and-proportion-of-variance-explained

In a PCA, variance can mean the summative variance, multivariate variability, overall variability, or total variability. As an example, consider the covariance matrix of three variables.

  1.343730519   -.160152268    .186470243 
  -.160152268    .619205620   -.126684273 
   .186470243   -.126684273   1.485549631

Their variances are on the diagonal and the sum of the three values (3.448) is the overall variability.

PCA replaces original variables with new variables, called principal components (PCs), which are orthogonal (i.e. they have zero co-variations) and have variances (called eigenvalues) in decreasing order. The covariance matrix between the principal components extracted from the above data is:

  1.651354285    .000000000    .000000000 
   .000000000   1.220288343    .000000000 
   .000000000    .000000000    .576843142

Note that the diagonal sum is still 3.448, which says that all three components account for all the multivariate variability. The first PC accounts for or explains 1.651/3.448 = 47.9% of the overall variability; the second PC explains 1.220/3.448 = 35.4% of it; the third PC explains .577/3.448 = 16.7% of it.

What do they mean when they say that "PCA maximises varianance" or "PCA explains maximal variance? That is not that it finds the largest variance among three values. PCA finds, in the data space, the dimension (direction) with the largest variance out of the overall variance. That largest variable would be 1.651. Then it finds the dimension of the second largest variance, orthogonal to the first one, out of the remaining 3.448-1.651 overall variance.

Mathematically, PCA is performed via linear algebra functions called eigen-decomposition or svd-decomposition. These functions will return you all the eigenvalues 1.651, 1.220, and 0.577 and corresponding eigenvectors.

R packages

Example

Using base R.

iris.pca <- prcomp(x = iris[,-5], scale. = TRUE)

names(iris.pca)
[1] "sdev"     "rotation" "center"   "scale"    "x"

iris.pca
Standard deviations (1, .., p=4):
[1] 1.7083611 0.9560494 0.3830886 0.1439265

Rotation (n x k) = (4 x 4):
                    PC1         PC2        PC3        PC4
Sepal.Length  0.5210659 -0.37741762  0.7195664  0.2612863
Sepal.Width  -0.2693474 -0.92329566 -0.2443818 -0.1235096
Petal.Length  0.5804131 -0.02449161 -0.1421264 -0.8014492
Petal.Width   0.5648565 -0.06694199 -0.6342727  0.5235971

summary(iris.pca)
Importance of components:
                          PC1    PC2     PC3     PC4
Standard deviation     1.7084 0.9560 0.38309 0.14393
Proportion of Variance 0.7296 0.2285 0.03669 0.00518
Cumulative Proportion  0.7296 0.9581 0.99482 1.00000

The rotation slot contains the matrix of variable loadings (i.e., a matrix whose columns contain the eigenvectors). The function princomp returns this in the element loadings.

Using ade4.

dudi.pca() is the main function that implements PCA for ade4 and by default, it is interactive: It lets the user insert the number of retained dimensions.

install.packages("ade4")
library(ade4)

# Run a PCA using the 10 non-binary numeric variables.
cars_pca <- dudi.pca(cars[,9:19], scannf = FALSE, nf = 4)

# Explore the summary of cars_pca.
summary(cars_pca)

Further reading

References

  1. Principal Component Analysis on Wikipedia http://en.wikipedia.org/wiki/Principal_component_analysis
  2. Principal Component Analysis 4 Dummies: Eigenvectors, Eigenvalues and Dimension Reduction https://georgemdallas.wordpress.com/2013/10/30/principal-component-analysis-4-dummies-eigenvectors-eigenvalues-and-dimension-reduction/
  3. 3.0 3.1 What is principal component analysis http://www.nature.com/nbt/journal/v26/n3/full/nbt0308-303.html
  4. Implementing a Principal Component Analysis (PCA) in Python step by step http://sebastianraschka.com/Articles/2014_pca_step_by_step.html
  5. Step by step principal components analysis using R http://davetang.org/muse/2012/02/01/step-by-step-principal-components-analysis-using-r/
  6. Explaining PCA to a school child http://davetang.org/muse/2012/12/17/explaining-pca-to-a-school-child/
  7. Singular Value Decomposition Tutorial http://www.ling.ohio-state.edu/~kbaker/pubs/Singular_Value_Decomposition_Tutorial.pdf
  8. What is principal component analysis? https://liorpachter.wordpress.com/2014/05/26/what-is-principal-component-analysis/
  9. Principal Component Analysis Explained Visually http://setosa.io/ev/principal-component-analysis/
  10. Singular value decomposition and principal component analysis http://public.lanl.gov/mewall/kluwer2002.html
  11. Principal Component Analysis in R https://poissonisfish.wordpress.com/2017/01/23/principal-component-analysis-in-r/
  12. Does each eigenvalue in PCA correspond to one particular original variable? https://stats.stackexchange.com/questions/161799/does-each-eigenvalue-in-pca-correspond-to-one-particular-original-variable
  13. A tutorial on Principal Components Analysis http://www.cs.otago.ac.nz/cosc453/student_tutorials/principal_components.pdf