# Principal Component Analysis

http://davetang.org/file/principal_components.pdf originally from http://www.cs.otago.ac.nz/cosc453/student_tutorials/principal_components.pdf

Description from a friend (thanks!):

PCA basically reduces data to the principal components — the main contributions to the variation. So, you end up with (say) a 2D plot where one axis is the first principal component and the second is the second principal component. All others are unimportant; you can also just look at one or more...but most biology papers I've seen choose two. The problem is that you now have a scatter of dots on a 2D plane and if you're lucky, your data classes separate; but that sometimes doesn't happen. So, there's a bit of guesswork that goes into it.

And from wikipedia:

Principal component analysis (PCA) involves a mathematical procedure that __transforms a number of possibly correlated variables into a smaller number of uncorrelated variables called principal components. The first principal component accounts for as much of the variability in the data as possible, and each succeeding component accounts for as much of the remaining variability as possible.

PCA was invented in 1901 by Karl Pearson. Now it is mostly used as a tool in exploratory data analysis and for making predictive models. PCA involves the calculation of the eigenvalue decomposition of a data covariance matrix or singular value decomposition of a data matrix, usually after mean centering the data for each attribute. The results of a PCA are usually discussed in terms of component scores and loadings (Shaw, 2003).

PCA is the simplest of the true eigenvector-based multivariate analyses. Often, its operation can be thought of as revealing the internal structure of the data in a way which best explains the variance in the data. If a multivariate dataset is visualised as a set of coordinates in a high-dimensional data space (1 axis per variable), PCA supplies the user with a lower-dimensional picture, a "shadow" of this object when viewed from its (in some sense) most informative viewpoint.

PCA is closely related to factor analysis; indeed, some statistical packages deliberately conflate the two techniques. True factor analysis makes different assumptions about the underlying structure and solves eigenvectors of a slightly different matrix.

# PCA using R

http://cran.r-project.org/web/packages/HSAUR/vignettes/Ch_principal_components_analysis.pdf

A sample R script to do a PCA analysis (provided by the same friend, thanks again!)

#!/usr/bin/env Rscript ################################################## ## Read in the data and perform PCA ################################################## ## Read the data in using a tab character as a separator. The first row and column are headers data <- read.table (file="yeast.data", sep="\t", header=TRUE, row.names=1) ## Perform PCA with scaling using SVD #data.prc <- prcomp (data, scale=TRUE) data.prc <- prcomp (data) ## Show the loadings to two decimal places summary (data.prc, loadings=TRUE, digits=2) ################################################## ## Extract the loadings and the scores ################################################## ## Rotation matrix is the loadings loadings <- data.prc$rotation ## The scores is the x component; formed by multiplying the loadings with the original data scores <-data.prc$x ################################################## ## Handle the scores ################################################## ## Plot of the scores, with the axes #png (filename="scores.png", width=480, height=480, bg="transparent", pointsize=12) png (filename="scores.png", width=480, height=480, pointsize=12) plot (scores[,1], scores[,2], xlab="Scores 1", ylab="Scores 2") text (x=scores[,1], y=scores[,2], labels=row.names (scores), pos=4) lines(c(-5,5),c(0,0),lty=2) ## Draw the horizontal axis lines(c(0,0),c(-4,3),lty=2) ## Draw the vertical axis dev.off () ## Close the image ## Hierarchical clustering of the first 6 PC of scores scores2 <- scores[,1:6] ## Calculate the hierarchical cluster of scores2 scores_dist <- dist (scores2) ## Calculate the hierarchical clustering of the loadings hclust_scores <- hclust (scores_dist) ## Show the hierarchical clustering png (filename="scores-hc.png", width=480, height=480, pointsize=12) #png (filename="scores-hc.png", width=480, height=480, bg="transparent", pointsize=12) plot (hclust_scores) dev.off () ## Close the image ################################################## ## Handle the loadings ################################################## ## Plot of the loadings #png (filename="loadings.png", width=480, height=480, bg="transparent", pointsize=12) png (filename="loadings.png", width=480, height=480, pointsize=12) plot (loadings[,1], loadings[,2], xlab="Loadings 1", ylab="Loadings 2") text (x=loadings[,1], y=loadings[,2], labels=row.names (loadings), pos=4) dev.off () ## Close the image ## Hierarchical clustering of the first 6 PC of loadings loadings2 <- loadings[,1:6] ## Calculate the hierarchical cluster of loadings2 loadings_dist <- dist (loadings2) ## Calculate the hierarchical clustering of the loadings hclust_loadings <- hclust (loadings_dist) ## Show the hierarchical clustering #png (filename="loadings-hc.png", width=480, height=480, bg="transparent", pointsize=12) png (filename="loadings-hc.png", width=480, height=480, pointsize=12) plot (hclust_loadings) dev.off () ## Close the image