Singular Vector Decomposition using R

In linear algebra terms, a Singular Vector Decomposition (SVD) is the decomposition of a matrix X into three matrices, each having special properties. If X is a matrix with each variable in a column and each observation in a row then the SVD is

$$!X = UDV^T$$

where the columns of U are orthogonal (left singular vectors), the columns of V are orthogonal (right singluar vectors) and D is a diagonal matrix (singular values). Here I perform a SVD on the iris dataset in R.

#use the iris dataset
#for more info type ?iris
names(iris)
[1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width"  "Species"

head(iris)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

#perform hierarchical clustering
h <- hclust(dist(iris[,c(1:4)]))

#inside the h object
names(h)
[1] "merge"       "height"      "order"       "labels"      "method"      "call"       
[7] "dist.method"

#order corresponds to the hierarchical clustering order on iris rows
h$order
  [1] 108 131 103 126 130 119 106 123 118 132 110 136 141 145 125 121 144 101 137 149 116 111 148 113 140
 [26] 142 146 109 104 117 138 105 129 133 150  71 128 139 115 122 114 102 143 135 112 147 124 127  73  84
 [51] 134 120  69  88  66  76  77  55  59  78  87  51  53  86  52  57  75  98  74  79  64  92  61  99  58
 [76]  94 107  67  85  56  91  62  72  68  83  93  95 100  89  96  97  63  65  80  60  54  90  70  81  82
[101]  42  30  31  26  10  35  13   2  46  36   5  38  28  29  41   1  18  50   8  40  23   7  43   3   4
[126]  48  14   9  39  17  33  34  15  16   6  19  21  32  37  11  49  45  47  20  22  44  24  27  12  25

#order dataset by hierarchical clustering
data_ordered <- iris[h$order,]

head(data_ordered)
    Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
108          7.3         2.9          6.3         1.8 virginica
131          7.4         2.8          6.1         1.9 virginica
103          7.1         3.0          5.9         2.1 virginica
126          7.2         3.2          6.0         1.8 virginica
130          7.2         3.0          5.8         1.6 virginica
119          7.7         2.6          6.9         2.3 virginica

#perform the SVD
svd1 <- svd(data_ordered[,c(1:4)])

#inside the svd1 object are the 3 separate matrices
names(svd1)
[1] "d" "u" "v"

#left singular vectors corresponding to the rows
dim(svd1$u)
[1] 150   4

#right singular vectors corresponding to the columns
dim(svd1$v)
[1] 4 4

#singular vectors
svd1$d
[1] 95.959914 17.761034  3.460931  1.884826

head(svd1$u)
           [,1]       [,2]        [,3]         [,4]
[1,] -0.1054558 0.08012812 -0.10637239 -0.134454550
[2,] -0.1049482 0.07556145 -0.12829465 -0.009698646
[3,] -0.1026729 0.07009470 -0.01813207  0.036368121
[4,] -0.1042575 0.06052312 -0.03846043 -0.125453575
[5,] -0.1020462 0.05482986 -0.11193348 -0.120558750
[6,] -0.1114810 0.11657800 -0.13510080  0.030542780

#plot all left singular vectors
par(mfrow=c(1,4))
plot(svd1$u[,1],1:150,pch=19)
plot(svd1$u[,2],1:150,pch=19)
plot(svd1$u[,3],1:150,pch=19)
plot(svd1$u[,4],1:150,pch=19)
#reset graphical parameter
par(mfrow=c(1,1))

svd_left_vector_irisThe first, second, third and fourth left singular vectors.

How do the original data points look?

#the hierarchical clustering order ordered the species nicely
table(data_ordered[ c(1:50),5])

    setosa versicolor  virginica 
         0          3         47

table(data_ordered[ c(51:100),5])

    setosa versicolor  virginica 
         0         47          3

table(data_ordered[ c(101:150),5])

    setosa versicolor  virginica 
        50          0          0

plot(data_ordered$Sepal.Length,c(1:150),pch=19,xlim=c(0,8.1),xlab="Length")
points(data_ordered$Sepal.Width,c(1:150),col=2, pch=19)
points(data_ordered$Petal.Length,c(1:150),col=3, pch=19)
points(data_ordered$Petal.Width,c(1:150),col=4, pch=19)
abline(h=50)
abline(h=100)

iris_scatterplot_all_variable_ablineEach measurement is coloured in a different colour. The hierarchical clustering order makes it easier to see different properties of different species.

Plotting the first and second left singular vectors and colouring by species.

first <- svd1$u[,1]
second <- svd1$u[,2]
species <- data_ordered$Species
species <- as.numeric(species)
first <- data.frame(first,species)
second <- data.frame(second,species)
plot(first$first, second$second, pch=19, col=first$species, xlab="First left singular vector", ylab="Second left singular vector")

first_second_left_singular_vectorIn green are virginica, red are versicolor and black are setosa.

Variance explained by the number of singular vectors

svd1$d^2/sum(svd1$d^2)
[1] 0.9653029807 0.0330689513 0.0012556535 0.0003724145
plot(svd1$d^2/sum(svd1$d^2), pch=19, xlab="Singluar vector", ylab="Variance explained")

variance_explained_irisOne singular vector is enough to explain 96.5% of the variance.

Obtaining the original matrix

#multiply each singular matrix to obtain the original
original <- svd1$u[,1:4] %*% diag(svd1$d[1:4]) %*% t(svd1$v[,1:4])
head(original)
     [,1] [,2] [,3] [,4]
[1,]  7.3  2.9  6.3  1.8
[2,]  7.4  2.8  6.1  1.9
[3,]  7.1  3.0  5.9  2.1
[4,]  7.2  3.2  6.0  1.8
[5,]  7.2  3.0  5.8  1.6
[6,]  7.7  2.6  6.9  2.3
head(data_ordered)
    Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
108          7.3         2.9          6.3         1.8 virginica
131          7.4         2.8          6.1         1.9 virginica
103          7.1         3.0          5.9         2.1 virginica
126          7.2         3.2          6.0         1.8 virginica
130          7.2         3.0          5.8         1.6 virginica
119          7.7         2.6          6.9         2.3 virginica

Summary

Singular Vector Decomposition can reduce a large matrix of values into 3 separate matrices, each having special properties. The right singular vectors are actually the same as the principal components in a PCA (see this article for more information). The left singular vectors correspond to the rows of the matrix and showed that setosa are characteristic different from the other two species. The diagonal matrix D provides the amount of variance explained by the number of singular vectors.

SVD was taught in week 3 of the Data Analysis course provided by coursera. Some of the code in this post was adapted from the dimension reduction lecture; please refer to the lecture for more information.




Creative Commons License
This work is licensed under a Creative Commons
Attribution 4.0 International License
.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.