Principal Component Analysis

Updated 2018 January 16th; rewrote entire post

It takes literally one line of code in R to conduct a Principal Component Analysis (PCA).

# PCA on the famous iris dataset
iris.prcomp <- prcomp(iris[, -5], scale. = TRUE)

Yet to this day, I am still trying to understand all the details of the method. I have links to various resources on my PCA wiki page that have helped me understand the method a bit better. Now, let’s look into the results of prcomp().

class(iris.prcomp)
[1] "prcomp"

summary(iris.pca)

Call:
PCA(X = iris[, -5], graph = FALSE) 


Eigenvalues
                       Dim.1   Dim.2   Dim.3   Dim.4
Variance               2.918   0.914   0.147   0.021
% of var.             72.962  22.851   3.669   0.518
Cumulative % of var.  72.962  95.813  99.482 100.000

Individuals (the 10 first)
                 Dist    Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3    ctr   cos2  
1            |  2.319 | -2.265  1.172  0.954 |  0.480  0.168  0.043 | -0.128  0.074  0.003 |
2            |  2.202 | -2.081  0.989  0.893 | -0.674  0.331  0.094 | -0.235  0.250  0.011 |
3            |  2.389 | -2.364  1.277  0.979 | -0.342  0.085  0.020 |  0.044  0.009  0.000 |
4            |  2.378 | -2.299  1.208  0.935 | -0.597  0.260  0.063 |  0.091  0.038  0.001 |
5            |  2.476 | -2.390  1.305  0.932 |  0.647  0.305  0.068 |  0.016  0.001  0.000 |
6            |  2.555 | -2.076  0.984  0.660 |  1.489  1.617  0.340 |  0.027  0.003  0.000 |
7            |  2.468 | -2.444  1.364  0.981 |  0.048  0.002  0.000 |  0.335  0.511  0.018 |
8            |  2.246 | -2.233  1.139  0.988 |  0.223  0.036  0.010 | -0.089  0.036  0.002 |
9            |  2.592 | -2.335  1.245  0.812 | -1.115  0.907  0.185 |  0.145  0.096  0.003 |
10           |  2.249 | -2.184  1.090  0.943 | -0.469  0.160  0.043 | -0.254  0.293  0.013 |

Variables
                Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3    ctr   cos2  
Sepal.Length |  0.890 27.151  0.792 |  0.361 14.244  0.130 | -0.276 51.778  0.076 |
Sepal.Width  | -0.460  7.255  0.212 |  0.883 85.247  0.779 |  0.094  5.972  0.009 |
Petal.Length |  0.992 33.688  0.983 |  0.023  0.060  0.001 |  0.054  2.020  0.003 |
Petal.Width  |  0.965 31.906  0.931 |  0.064  0.448  0.004 |  0.243 40.230  0.059 |

names(iris.prcomp)
[1] "sdev"     "rotation" "center"   "scale"    "x"

# eigenvalues = sdev^2
iris.prcomp$sdev^2
[1] 2.91849782 0.91403047 0.14675688 0.02071484

# use package factoextra to make nice plots
library(factoextra)

get_eig(iris.prcomp)
      eigenvalue variance.percent cumulative.variance.percent
Dim.1 2.91849782       72.9624454                    72.96245
Dim.2 0.91403047       22.8507618                    95.81321
Dim.3 0.14675688        3.6689219                    99.48213
Dim.4 0.02071484        0.5178709                   100.00000

PCA is used to create linear combinations of the original data that capture as much information in the original data as possible.

PCA on mtcars

Run a PCA on a test dataset that is distributed with R:

# installing package for plotting purposes
install.packages("wordcloud")

# load library
library(wordcloud)

data.pca <- prcomp(mtcars)

# plot first and second PCs
textplot(data.pca$x[,1],
         data.pca$x[,2],
         row.names(data.pca$x),
         cex = 0.7,
         xlim=c(-250,300),
         ylim=c(-150,75)
         )

mtcars_pcaI don’t know much about cars, but at least the MERC 450s are close to each other. In addition, I guess most of the Japanese brands are on the left and the American brands are on the right?.

Print Friendly, PDF & Email



Creative Commons License
This work is licensed under a Creative Commons
Attribution 4.0 International License
.
Posted in RTagged

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.