Principal Component Analysis
Principal Component Analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components.
PCA finds the principal components, the underlying structure in the data. They are the directions where there is the most variance, the directions where the data is most spread out.
It accomplishes this reduction by identifying directions, called principal components, along which the variation in the data is maximal. PCA identifies new variables, the principal components, which are linear combinations of the original variables.
Correlation indicates that there is redundancy in the data, therefore we can simplify the data by replacing a group of correlated variables with a new single variable. PCA creates a new set of variables called principal components, which are a linear combination of the original variables.
Imagine conducting an experiment and measuring various variables that result from the experiment to help understanding the dynamics of the system.
Why it may be possible to reduce dimensions?
When we have correlations (multicollinarity) between the x-variables, the data may more or less fall on a line or plane in a lower number of dimensions. For instance, imagine a plot of two x-variables that have a nearly perfect correlation. The data points will fall close to a straight line and that line could be used as a new (one-dimensional) axis to represent the variation among data points. As another example, suppose that we have verbal, math, and total SAT scores for a sample of students. We have three variables, but really (at most) two dimensions to the data because total = verbal +math, meaning the third variable is completely determined by the first two. The reason for saying "at most" two dimensions is that if there is a strong correlation between verbal and math, then it may be possible that there is only one true dimension to the data. See https://onlinecourses.science.psu.edu/stat505/node/52.
Background mathematics necessary for understanding PCA.
- Standard deviation
Standard deviation and variance
The Standard Deviation (SD) of a data set is a measure of how spread out the data is. The English definition of the SD is: “The average distance from the mean of the data set to a point”. The way to calculate it is to compute the squares of the distance from each data point to the mean of the set, add them all up, divide by n-1, and take the positive square root. Variance is another measure of the spread of data in a data set and is simply the SD squared.
Standard deviation and variance only operate on one dimension, so that you could only calculate the standard deviation for each dimension of the data set independently of the other dimensions. However, it is useful to have a similar measure to find out how much the dimensions vary from the mean with respect to each other. Covariance is always measured between 2 dimensions. If you calculate the covariance between one dimension and itself, you get the variance. If you had a 3-dimensional data set (x, y, z), then you could measure the covariance between x and y dimensions, the x and z dimensions, and the y and z dimensions. The English definition of covariance is “For each data item, multiply the difference between the x value and the mean of x, by the the difference between the y value and the mean of y. Add all these up, and divide by n-1”.
head(women) height weight 1 58 115 2 59 117 3 60 120 4 61 123 5 62 126 6 63 129 cov(women) height weight height 20 69.0000 weight 69 240.2095 cor(women) height weight height 1.0000000 0.9954948 weight 0.9954948 1.0000000 set.seed(31) a <- sample(min(women$height):max(women$height), nrow(women)) b <- sample(min(women$weight):max(women$weight), nrow(women)) blah <- data.frame(height = a, weight = b) head(blah) height weight 1 65 120 2 71 119 3 63 137 4 70 125 5 68 163 6 62 121 cov(blah) height weight height 20.00000 -12.21429 weight -12.21429 254.60000 cor(blah) height weight height 1.0000000 -0.1711685 weight -0.1711685 1.0000000
The covariance values indicate whether both dimensions increase together (positive value seen in the women dataset), are independent of each other (zero value), or as one increases, the other decreases (negative value in the random dataset I created).
Eigenvectors and eigenvalues
We can deconstruct data into eigenvectors and eigenvalues, which always exist in pairs; every eigenvector has a corresponding eigenvalue. An eigenvector is a direction and an eigenvalue is a number that indicates how much variance is in the data in that direction, i.e. the eigenvalue is a number indicating the spread of the data in that direction. The eigenvector with the highest eigenvalue is the principal component. The number of eigenvectors/values that exists in a data set is equal to the number of dimensions of the data. The reason for this is that eigenvectors put the data into a new set of dimensions, and these new dimensions have to be equal to the original amount of dimensions.
An eigenvalue is a number, telling you how much variance there is in the data in that direction, in the example above the eigenvalue is a number telling us how spread out the data is on the line.
When the number of variables is larger than the number of samples, PCA can reduce the dimensionality of the samples to, at most, the number of samples, without loss of information.
Using base R.
iris.pca <- prcomp(x = iris[,-5], scale. = TRUE) names(iris.pca)  "sdev" "rotation" "center" "scale" "x" iris.pca Standard deviations (1, .., p=4):  1.7083611 0.9560494 0.3830886 0.1439265 Rotation (n x k) = (4 x 4): PC1 PC2 PC3 PC4 Sepal.Length 0.5210659 -0.37741762 0.7195664 0.2612863 Sepal.Width -0.2693474 -0.92329566 -0.2443818 -0.1235096 Petal.Length 0.5804131 -0.02449161 -0.1421264 -0.8014492 Petal.Width 0.5648565 -0.06694199 -0.6342727 0.5235971 summary(iris.pca) Importance of components: PC1 PC2 PC3 PC4 Standard deviation 1.7084 0.9560 0.38309 0.14393 Proportion of Variance 0.7296 0.2285 0.03669 0.00518 Cumulative Proportion 0.7296 0.9581 0.99482 1.00000
The rotation slot contains the matrix of variable loadings (i.e., a matrix whose columns contain the eigenvectors). The function princomp returns this in the element loadings.
dudi.pca() is the main function that implements PCA for ade4 and by default, it is interactive: It lets the user insert the number of retained dimensions.
install.packages("ade4") library(ade4) # Run a PCA using the 10 non-binary numeric variables. cars_pca <- dudi.pca(cars[,9:19], scannf = FALSE, nf = 4) # Explore the summary of cars_pca. summary(cars_pca)
- Understanding PCA using Stack Overflow data https://juliasilge.com/blog/stack-overflow-pca/
- Principal Component Analysis on Wikipedia http://en.wikipedia.org/wiki/Principal_component_analysis
- Principal Component Analysis 4 Dummies: Eigenvectors, Eigenvalues and Dimension Reduction https://georgemdallas.wordpress.com/2013/10/30/principal-component-analysis-4-dummies-eigenvectors-eigenvalues-and-dimension-reduction/
- What is principal component analysis http://www.nature.com/nbt/journal/v26/n3/full/nbt0308-303.html
- Implementing a Principal Component Analysis (PCA) in Python step by step http://sebastianraschka.com/Articles/2014_pca_step_by_step.html
- Step by step principal components analysis using R http://davetang.org/muse/2012/02/01/step-by-step-principal-components-analysis-using-r/
- Explaining PCA to a school child http://davetang.org/muse/2012/12/17/explaining-pca-to-a-school-child/
- Singular Value Decomposition Tutorial http://www.ling.ohio-state.edu/~kbaker/pubs/Singular_Value_Decomposition_Tutorial.pdf
- What is principal component analysis? https://liorpachter.wordpress.com/2014/05/26/what-is-principal-component-analysis/
- Principal Component Analysis Explained Visually http://setosa.io/ev/principal-component-analysis/
- Singular value decomposition and principal component analysis http://public.lanl.gov/mewall/kluwer2002.html
- Principal Component Analysis in R https://poissonisfish.wordpress.com/2017/01/23/principal-component-analysis-in-r/
- Does each eigenvalue in PCA correspond to one particular original variable? https://stats.stackexchange.com/questions/161799/does-each-eigenvalue-in-pca-correspond-to-one-particular-original-variable
- A tutorial on Principal Components Analysis http://www.cs.otago.ac.nz/cosc453/student_tutorials/principal_components.pdf