Principal Component Analysis
Principal Component Analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components.[1]
PCA finds the principal components, the underlying structure in the data. They are the directions where there is the most variance, the directions where the data is most spread out.[2]
It accomplishes this reduction by identifying directions, called principal components, along which the variation in the data is maximal. PCA identifies new variables, the principal components, which are linear combinations of the original variables.[3]
Correlation indicates that there is redundancy in the data, therefore we can simplify the data by replacing a group of correlated variables with a new single variable. PCA creates a new set of variables called principal components, which are a linear combination of the original variables.
[4] [5] [6] [7] [8] [9] [10] [11] [12]
Intuition
Imagine conducting an experiment and measuring various variables that result from the experiment to help understanding the dynamics of the system.
https://davetang.org/file/PCA-Tutorial-Intuition_jp.pdf
Why it may be possible to reduce dimensions?
When we have correlations (multicollinarity) between the x-variables, the data may more or less fall on a line or plane in a lower number of dimensions. For instance, imagine a plot of two x-variables that have a nearly perfect correlation. The data points will fall close to a straight line and that line could be used as a new (one-dimensional) axis to represent the variation among data points. As another example, suppose that we have verbal, math, and total SAT scores for a sample of students. We have three variables, but really (at most) two dimensions to the data because total = verbal +math, meaning the third variable is completely determined by the first two. The reason for saying "at most" two dimensions is that if there is a strong correlation between verbal and math, then it may be possible that there is only one true dimension to the data. See https://onlinecourses.science.psu.edu/stat505/node/52.
Background mathematics
Background mathematics necessary for understanding PCA.[13]
- Standard deviation
- Covariance
- Eigenvectors
- Eigenvalues
Standard deviation and variance
The Standard Deviation (SD) of a data set is a measure of how spread out the data is. The English definition of the SD is: “The average distance from the mean of the data set to a point”. The way to calculate it is to compute the squares of the distance from each data point to the mean of the set, add them all up, divide by n-1, and take the positive square root. Variance is another measure of the spread of data in a data set and is simply the SD squared.
Covariance
Standard deviation and variance only operate on one dimension, so that you could only calculate the standard deviation for each dimension of the data set independently of the other dimensions. However, it is useful to have a similar measure to find out how much the dimensions vary from the mean with respect to each other. Covariance is always measured between 2 dimensions. If you calculate the covariance between one dimension and itself, you get the variance. If you had a 3-dimensional data set (x, y, z), then you could measure the covariance between x and y dimensions, the x and z dimensions, and the y and z dimensions. The English definition of covariance is “For each data item, multiply the difference between the x value and the mean of x, by the the difference between the y value and the mean of y. Add all these up, and divide by n-1”.
In R:
head(women) height weight 1 58 115 2 59 117 3 60 120 4 61 123 5 62 126 6 63 129 cov(women) height weight height 20 69.0000 weight 69 240.2095 cor(women) height weight height 1.0000000 0.9954948 weight 0.9954948 1.0000000 set.seed(31) a <- sample(min(women$height):max(women$height), nrow(women)) b <- sample(min(women$weight):max(women$weight), nrow(women)) blah <- data.frame(height = a, weight = b) head(blah) height weight 1 65 120 2 71 119 3 63 137 4 70 125 5 68 163 6 62 121 cov(blah) height weight height 20.00000 -12.21429 weight -12.21429 254.60000 cor(blah) height weight height 1.0000000 -0.1711685 weight -0.1711685 1.0000000
The covariance values indicate whether both dimensions increase together (positive value seen in the women dataset), are independent of each other (zero value), or as one increases, the other decreases (negative value in the random dataset I created).
Eigenvectors and eigenvalues
We can deconstruct data into eigenvectors and eigenvalues, which always exist in pairs; every eigenvector has a corresponding eigenvalue. An eigenvector is a direction and an eigenvalue is a number that indicates how much variance is in the data in that direction, i.e. the eigenvalue is a number indicating the spread of the data in that direction. The eigenvector with the highest eigenvalue is the principal component. The number of eigenvectors/values that exists in a data set is equal to the number of dimensions of the data. The reason for this is that eigenvectors put the data into a new set of dimensions, and these new dimensions have to be equal to the original amount of dimensions.
An eigenvalue is a number, telling you how much variance there is in the data in that direction, in the example above the eigenvalue is a number telling us how spread out the data is on the line.
http://setosa.io/ev/eigenvectors-and-eigenvalues/
Dimension reduction
When the number of variables is larger than the number of samples, PCA can reduce the dimensionality of the samples to, at most, the number of samples, without loss of information.[3]
Variance explained
https://stats.stackexchange.com/questions/22569/pca-and-proportion-of-variance-explained
In a PCA, variance can mean the summative variance, multivariate variability, overall variability, or total variability. As an example, consider the covariance matrix of three variables.
1.343730519 -.160152268 .186470243 -.160152268 .619205620 -.126684273 .186470243 -.126684273 1.485549631
Their variances are on the diagonal and the sum of the three values (3.448) is the overall variability.
PCA replaces original variables with new variables, called principal components (PCs), which are orthogonal (i.e. they have zero co-variations) and have variances (called eigenvalues) in decreasing order. The covariance matrix between the principal components extracted from the above data is:
1.651354285 .000000000 .000000000 .000000000 1.220288343 .000000000 .000000000 .000000000 .576843142
Note that the diagonal sum is still 3.448, which says that all three components account for all the multivariate variability. The first PC accounts for or explains 1.651/3.448 = 47.9% of the overall variability; the second PC explains 1.220/3.448 = 35.4% of it; the third PC explains .577/3.448 = 16.7% of it.
What do they mean when they say that "PCA maximises varianance" or "PCA explains maximal variance? That is not that it finds the largest variance among three values. PCA finds, in the data space, the dimension (direction) with the largest variance out of the overall variance. That largest variable would be 1.651. Then it finds the dimension of the second largest variance, orthogonal to the first one, out of the remaining 3.448-1.651 overall variance.
Mathematically, PCA is performed via linear algebra functions called eigen-decomposition or svd-decomposition. These functions will return you all the eigenvalues 1.651, 1.220, and 0.577 and corresponding eigenvectors.
R packages
- FactoMineR
- ade4
- stats
- ca
- MASS
- ExPosition
Example
Using base R.
iris.pca <- prcomp(x = iris[,-5], scale. = TRUE) names(iris.pca) [1] "sdev" "rotation" "center" "scale" "x" iris.pca Standard deviations (1, .., p=4): [1] 1.7083611 0.9560494 0.3830886 0.1439265 Rotation (n x k) = (4 x 4): PC1 PC2 PC3 PC4 Sepal.Length 0.5210659 -0.37741762 0.7195664 0.2612863 Sepal.Width -0.2693474 -0.92329566 -0.2443818 -0.1235096 Petal.Length 0.5804131 -0.02449161 -0.1421264 -0.8014492 Petal.Width 0.5648565 -0.06694199 -0.6342727 0.5235971 summary(iris.pca) Importance of components: PC1 PC2 PC3 PC4 Standard deviation 1.7084 0.9560 0.38309 0.14393 Proportion of Variance 0.7296 0.2285 0.03669 0.00518 Cumulative Proportion 0.7296 0.9581 0.99482 1.00000
The rotation slot contains the matrix of variable loadings (i.e., a matrix whose columns contain the eigenvectors). The function princomp returns this in the element loadings.
Using ade4.
dudi.pca() is the main function that implements PCA for ade4 and by default, it is interactive: It lets the user insert the number of retained dimensions.
install.packages("ade4") library(ade4) # Run a PCA using the 10 non-binary numeric variables. cars_pca <- dudi.pca(cars[,9:19], scannf = FALSE, nf = 4) # Explore the summary of cars_pca. summary(cars_pca)
Further reading
- Understanding PCA using Stack Overflow data https://juliasilge.com/blog/stack-overflow-pca/
- https://stats.stackexchange.com/questions/133149/why-is-variance-instead-of-standard-deviation-the-default-measure-of-informati
- http://www.sthda.com/english/articles/31-principal-component-methods-in-r-practical-guide/112-pca-principal-component-analysis-essentials/
- http://www.sthda.com/english/articles/31-principal-component-methods-in-r-practical-guide/118-principal-component-analysis-in-r-prcomp-vs-princomp/
- https://stats.stackexchange.com/questions/2691/making-sense-of-principal-component-analysis-eigenvectors-eigenvalues/140579#140579
- https://stats.stackexchange.com/questions/62677/pca-on-correlation-or-covariance-does-pca-on-correlation-ever-make-sense
- https://stats.stackexchange.com/questions/53/pca-on-correlation-or-covariance
- https://stats.stackexchange.com/questions/22569/pca-and-proportion-of-variance-explained
- https://stats.stackexchange.com/questions/115032/how-to-find-which-variables-are-most-correlated-with-the-first-principal-compone
- https://onlinecourses.science.psu.edu/stat505/node/49
- https://stats.stackexchange.com/questions/92499/how-to-interpret-pca-loadings/92512#92512
- https://stats.stackexchange.com/questions/143905/loadings-vs-eigenvectors-in-pca-when-to-use-one-or-another/143949#143949
References
- ↑ Principal Component Analysis on Wikipedia http://en.wikipedia.org/wiki/Principal_component_analysis
- ↑ Principal Component Analysis 4 Dummies: Eigenvectors, Eigenvalues and Dimension Reduction https://georgemdallas.wordpress.com/2013/10/30/principal-component-analysis-4-dummies-eigenvectors-eigenvalues-and-dimension-reduction/
- ↑ 3.0 3.1 What is principal component analysis http://www.nature.com/nbt/journal/v26/n3/full/nbt0308-303.html
- ↑ Implementing a Principal Component Analysis (PCA) in Python step by step http://sebastianraschka.com/Articles/2014_pca_step_by_step.html
- ↑ Step by step principal components analysis using R http://davetang.org/muse/2012/02/01/step-by-step-principal-components-analysis-using-r/
- ↑ Explaining PCA to a school child http://davetang.org/muse/2012/12/17/explaining-pca-to-a-school-child/
- ↑ Singular Value Decomposition Tutorial http://www.ling.ohio-state.edu/~kbaker/pubs/Singular_Value_Decomposition_Tutorial.pdf
- ↑ What is principal component analysis? https://liorpachter.wordpress.com/2014/05/26/what-is-principal-component-analysis/
- ↑ Principal Component Analysis Explained Visually http://setosa.io/ev/principal-component-analysis/
- ↑ Singular value decomposition and principal component analysis http://public.lanl.gov/mewall/kluwer2002.html
- ↑ Principal Component Analysis in R https://poissonisfish.wordpress.com/2017/01/23/principal-component-analysis-in-r/
- ↑ Does each eigenvalue in PCA correspond to one particular original variable? https://stats.stackexchange.com/questions/161799/does-each-eigenvalue-in-pca-correspond-to-one-particular-original-variable
- ↑ A tutorial on Principal Components Analysis http://www.cs.otago.ac.nz/cosc453/student_tutorials/principal_components.pdf