# Principal Component Analysis

Updated 2018 January 16th; rewrote entire post

It takes literally one line of code in R to conduct a Principal Component Analysis (PCA).

```# PCA on the famous iris dataset
iris.prcomp <- prcomp(iris[, -5], scale. = TRUE)
```

Yet to this day, I am still trying to understand all the details of the method. I have links to various resources on my PCA wiki page that have helped me understand the method a bit better. Now, let's look into the results of prcomp().

```class(iris.prcomp)
[1] "prcomp"

summary(iris.pca)

Call:
PCA(X = iris[, -5], graph = FALSE)

Eigenvalues
Dim.1   Dim.2   Dim.3   Dim.4
Variance               2.918   0.914   0.147   0.021
% of var.             72.962  22.851   3.669   0.518
Cumulative % of var.  72.962  95.813  99.482 100.000

Individuals (the 10 first)
Dist    Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3    ctr   cos2
1            |  2.319 | -2.265  1.172  0.954 |  0.480  0.168  0.043 | -0.128  0.074  0.003 |
2            |  2.202 | -2.081  0.989  0.893 | -0.674  0.331  0.094 | -0.235  0.250  0.011 |
3            |  2.389 | -2.364  1.277  0.979 | -0.342  0.085  0.020 |  0.044  0.009  0.000 |
4            |  2.378 | -2.299  1.208  0.935 | -0.597  0.260  0.063 |  0.091  0.038  0.001 |
5            |  2.476 | -2.390  1.305  0.932 |  0.647  0.305  0.068 |  0.016  0.001  0.000 |
6            |  2.555 | -2.076  0.984  0.660 |  1.489  1.617  0.340 |  0.027  0.003  0.000 |
7            |  2.468 | -2.444  1.364  0.981 |  0.048  0.002  0.000 |  0.335  0.511  0.018 |
8            |  2.246 | -2.233  1.139  0.988 |  0.223  0.036  0.010 | -0.089  0.036  0.002 |
9            |  2.592 | -2.335  1.245  0.812 | -1.115  0.907  0.185 |  0.145  0.096  0.003 |
10           |  2.249 | -2.184  1.090  0.943 | -0.469  0.160  0.043 | -0.254  0.293  0.013 |

Variables
Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3    ctr   cos2
Sepal.Length |  0.890 27.151  0.792 |  0.361 14.244  0.130 | -0.276 51.778  0.076 |
Sepal.Width  | -0.460  7.255  0.212 |  0.883 85.247  0.779 |  0.094  5.972  0.009 |
Petal.Length |  0.992 33.688  0.983 |  0.023  0.060  0.001 |  0.054  2.020  0.003 |
Petal.Width  |  0.965 31.906  0.931 |  0.064  0.448  0.004 |  0.243 40.230  0.059 |

names(iris.prcomp)
[1] "sdev"     "rotation" "center"   "scale"    "x"

# eigenvalues = sdev^2
iris.prcomp\$sdev^2
[1] 2.91849782 0.91403047 0.14675688 0.02071484

# use package factoextra to make nice plots
library(factoextra)

get_eig(iris.prcomp)
eigenvalue variance.percent cumulative.variance.percent
Dim.1 2.91849782       72.9624454                    72.96245
Dim.2 0.91403047       22.8507618                    95.81321
Dim.3 0.14675688        3.6689219                    99.48213
Dim.4 0.02071484        0.5178709                   100.00000
```

PCA is used to create linear combinations of the original data that capture as much information in the original data as possible.

### PCA on mtcars

Run a PCA on a test dataset that is distributed with R:

```# installing package for plotting purposes
install.packages("wordcloud")

library(wordcloud)

data.pca <- prcomp(mtcars)

# plot first and second PCs
textplot(data.pca\$x[,1],
data.pca\$x[,2],
row.names(data.pca\$x),
cex = 0.7,
xlim=c(-250,300),
ylim=c(-150,75)
)
```

I don't know much about cars, but at least the MERC 450s are close to each other. In addition, I guess most of the Japanese brands are on the left and the American brands are on the right?.