Gene Ontology Graph Visualisation

After scouring the web all afternoon looking for a solution for visualising gene ontology terms, which I have already found to be over represented, I finally found a simple solution. Prior to this, I had tried several Cytoscape plugins (BiNGO, ClueGO, etc.), online webtools (REVIGO, GOrilla, WEGO, GOLEM, etc.) and others I can't be bothered posting here but to no avail.

Now the problem with most of those tools is that one needs to perform the analysis within the tool to get the graphs. I had used an independent analysis (using GOseq) to get my enriched list of GO terms. REVIGO was almost perfect for my task, except I wanted to show all my GO terms, instead of collapsing them.

So the solution to my problem was just to use AmiGO. Say these were my enriched GO terms:

GO:0009653
GO:0048731
GO:0048856
GO:0009790
GO:0003206
GO:0048646
GO:0009887
GO:0003007

First, using the "Search" function of AmiGO paste these terms into the query box and then click submit.

In the "Term Search Results", find the "Select all" button and click it. Then in the "Perform action" box, select "create a new tree with these terms as a GraphViz image" and click "go". That's it.

You can install GraphViz, export the dot file from AmiGO, open it with GraphViz and lastly export as a vector file, which can be made into a publication quality figure.

graphviz

K means clustering

Updated: 2014 March 13th

From Wikipedia:

k-means clustering is a method of vector quantization, originally from signal processing, that is popular for cluster analysis in data mining. k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster.

I will perform k-means clustering following the example provided at http://www.statmethods.net/advstats/cluster.html and using the wine dataset, which I previously analysed using random forests.

data <- read.table('http://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data',
                   header=F,
                   sep=',')

names(data) <- c('Class',
                 'Alcohol',
                 'Malic acid',
                 'Ash',
                 'Alcalinity of ash',
                 'Magnesium',
                 'Total phenols',
                 'Flavanoids',
                 'Nonflavanoid phenols',
                 'Proanthocyanins',
                 'Color intensity',
                 'Hue',
                 'OD280/OD315 of diluted wines',
                 'Proline')

#how many different classes?
table(data$Class)

 1  2  3 
59 71 48

rownames(data) <- paste(rownames(data), '_', data$Class, sep="")
data <- data[,-1]
head(data)
    Alcohol Malic acid  Ash Alcalinity of ash Magnesium Total phenols Flavanoids Nonflavanoid phenols
1_1   14.23       1.71 2.43              15.6       127          2.80       3.06                 0.28
2_1   13.20       1.78 2.14              11.2       100          2.65       2.76                 0.26
3_1   13.16       2.36 2.67              18.6       101          2.80       3.24                 0.30
4_1   14.37       1.95 2.50              16.8       113          3.85       3.49                 0.24
5_1   13.24       2.59 2.87              21.0       118          2.80       2.69                 0.39
6_1   14.20       1.76 2.45              15.2       112          3.27       3.39                 0.34
    Proanthocyanins Color intensity  Hue OD280/OD315 of diluted wines Proline
1_1            2.29            5.64 1.04                         3.92    1065
2_1            1.28            4.38 1.05                         3.40    1050
3_1            2.81            5.68 1.03                         3.17    1185
4_1            2.18            7.80 0.86                         3.45    1480
5_1            1.82            4.32 1.04                         2.93     735
6_1            1.97            6.75 1.05                         2.85    1450

wss <- (nrow(data)-1)*sum(apply(data,2,var))
for (i in 2:15)
   wss[i] <- sum(kmeans(data, centers=i)$withinss)
plot(1:15, wss, type="b", xlab="Number of Clusters",
   ylab="Within groups sum of squares")

number_of_clusters

5 seems to be a good number of paritions based on the within groups sum of squares.

fit <- kmeans(data, 5)
aggregate(data,by=list(fit$cluster),FUN=mean)
  Group.1  Alcohol Malic acid      Ash Alcalinity of ash Magnesium Total phenols Flavanoids
1       1 13.52750   1.925938 2.370938          17.72500 106.50000      2.725000   2.742500
2       2 12.35871   2.160000 2.240000          20.26774  90.77419      2.324516   2.114839
3       3 13.86000   1.793913 2.506957          17.07391 106.00000      2.943043   3.110870
4       4 12.94184   2.600204 2.385306          19.72245 103.10204      2.077959   1.531837
5       5 12.67860   2.758372 2.357907          21.29070  94.00000      1.854884   1.425116
  Nonflavanoid phenols Proanthocyanins Color intensity       Hue OD280/OD315 of diluted wines   Proline
1            0.2887500        1.875938        4.988750 1.0426875                     3.089062 1017.4375
2            0.3593548        1.571290        3.166129 1.0303226                     2.742581  383.4516
3            0.2986957        1.926087        6.260000 1.1000000                     3.035652 1338.5652
4            0.3908163        1.483878        5.714082 0.8728571                     2.342653  713.1020
5            0.4188372        1.335581        5.083256 0.8616279                     2.241860  529.6047

data_fit <- data.frame(data, fit$cluster)

library(cluster)
clusplot(data_fit, fit$cluster, color=T, shade=T, labels=2, lines=0, cex=0.4)

clusplot

We know that there are three classes of wines, what would our results look like if we chose three clusters?

fit <- kmeans(data, 3)
data_fit <- data.frame(data, fit$cluster)

#how many of the class 1 wines were clustered into the same cluster?
table(data_fit[grep("_1", row.names(data_fit)),]$fit.cluster)

 1  2 
13 46

#how many of the class 2 wines were clustered into the same cluster?
table(data_fit[grep("_2", row.names(data_fit)),]$fit.cluster)

 1  2  3 
20  1 50

#how many of the class 3 wines were clustered into the same cluster?
table(data_fit[grep("_3", row.names(data_fit)),]$fit.cluster)

 1  3 
29 19

Random Forests in predicting wines

Updated 2014 September 17th to reflect changes in the R packages

Source http://mkseo.pe.kr/stats/?p=220.

Using Random Forests in predicting wines derived from three different cultivars. Download the wine data set from the Machine Learning Repository.

#install if necessary
install.packages("randomForest")
library(randomForest)
#randomForest 4.6-10
#Type rfNews() to see new features/changes/bug fixes.

#I've hosted this data set on my Dropbox account
#here are the steps to load it
install.packages('RCurl')
library(RCurl)

if( !file.exists("cacert.pem") ){
  download.file(url="http://curl.haxx.se/ca/cacert.pem",
                destfile = "cacert.pem")
}

my_url <- getURL("https://dl.dropboxusercontent.com/u/15251811/wine.csv",
               cainfo = "cacert.pem")

data <- read.table(textConnection(my_url), header=T, sep=",")

head(data)
#    class alcohol malic_acid  ash alcalinity_of_ash magnesium total_phenols flavanoids
#1 class_1   14.23       1.71 2.43              15.6       127          2.80       3.06
#2 class_1   13.20       1.78 2.14              11.2       100          2.65       2.76
#3 class_1   13.16       2.36 2.67              18.6       101          2.80       3.24
#4 class_1   14.37       1.95 2.50              16.8       113          3.85       3.49
#5 class_1   13.24       2.59 2.87              21.0       118          2.80       2.69
#6 class_1   14.20       1.76 2.45              15.2       112          3.27       3.39
#  nonflavanoid_phenols proanthocyanins colour_intensity  hue od280_od315 proline
#1                 0.28            2.29             5.64 1.04        3.92    1065
#2                 0.26            1.28             4.38 1.05        3.40    1050
#3                 0.30            2.81             5.68 1.03        3.17    1185
#4                 0.24            2.18             7.80 0.86        3.45    1480
#5                 0.39            1.82             4.32 1.04        2.93     735
#6                 0.34            1.97             6.75 1.05        2.85    1450

table(data$class)

#class_1 class_2 class_3 
#     59      71      48

test <- data[ c(1:10, 60:69, 131:140), ]
train <- data[ c(11:59, 70:130, 141:178), ]

r <- randomForest(class ~ ., data=train, importance=TRUE, do.trace=100)
#ntree      OOB      1      2      3
#  100:   2.70%  2.04%  4.92%  0.00%
#  200:   1.35%  0.00%  3.28%  0.00%
#  300:   1.35%  0.00%  3.28%  0.00%
#  400:   1.35%  0.00%  3.28%  0.00%
#  500:   1.35%  0.00%  3.28%  0.00%

r

#Call:
# randomForest(formula = class ~ ., data = train, importance = TRUE,      do.trace = 100) 
#               Type of random forest: classification
#                     Number of trees: 500
#No. of variables tried at each split: 3
#
#        OOB estimate of  error rate: 1.35%
#Confusion matrix:
#        class_1 class_2 class_3 class.error
#class_1      49       0       0  0.00000000
#class_2       1      59       1  0.03278689
#class_3       0       0      38  0.00000000

names(r)
# [1] "call"            "type"            "predicted"       "err.rate"        "confusion"      
# [6] "votes"           "oob.times"       "classes"         "importance"      "importanceSD"   
#[11] "localImportance" "proximity"       "ntree"           "mtry"            "forest"         
#[16] "y"               "test"            "inbag"           "terms"

r$call
#randomForest(formula = class ~ ., data = train, importance = TRUE, 
#    do.trace = 100)

r$type
#[1] "classification"

head(r$predicted)
#     11      12      13      14      15      16 
#class_1 class_1 class_1 class_1 class_1 class_1 
#Levels: class_1 class_2 class_3

# two incorrect predictions
# as seen above
table(r$predicted == train$class)
#
#FALSE  TRUE 
#    2   146

head(r$err.rate)
#            OOB    class_1    class_2    class_3
#[1,] 0.10869565 0.12500000 0.10526316 0.09090909
#[2,] 0.06741573 0.05882353 0.05555556 0.10526316
#[3,] 0.07142857 0.05263158 0.06250000 0.11538462
#[4,] 0.11200000 0.02439024 0.09803922 0.24242424
#[5,] 0.10447761 0.02272727 0.12727273 0.17142857
#[6,] 0.11029412 0.04444444 0.10909091 0.19444444

r$confusion
#        class_1 class_2 class_3 class.error
#class_1      49       0       0  0.00000000
#class_2       1      59       1  0.03278689
#class_3       0       0      38  0.00000000

head(r$votes)
#     class_1     class_2     class_3
#11 0.9905660 0.009433962 0.000000000
#12 0.8409091 0.119318182 0.039772727
#13 0.9285714 0.060439560 0.010989011
#14 0.9891304 0.005434783 0.005434783
#15 0.9940828 0.005917160 0.000000000
#16 1.0000000 0.000000000 0.000000000

head(r$oob.times)
#[1] 170 201 168 182 182 165

r$classes
#[1] "class_1" "class_2" "class_3"

head(r$importance)
#                      class_1       class_2     class_3 MeanDecreaseAccuracy
#alcohol           0.153752764  0.1018091391 0.024288014          0.097357260
#malic_acid        0.004476793  0.0020282530 0.022202767          0.007776722
#ash               0.001933002  0.0025004258 0.003690918          0.002475941
#alcalinity_of_ash 0.019837640  0.0016648544 0.018789050          0.011811606
#magnesium         0.046735362  0.0094560643 0.003287392          0.019929169
#total_phenols     0.074401796 -0.0005645205 0.034750325          0.033061689
#                  MeanDecreaseGini
#alcohol                 14.5954250
#malic_acid               2.1264398
#ash                      0.8337735
#alcalinity_of_ash        3.0953357
#magnesium                3.0854575
#total_phenols            5.3164194

head(r$importanceSD)
#                      class_1      class_2     class_3 MeanDecreaseAccuracy
#alcohol           0.008291376 0.0056877344 0.003221596         0.0045790582
#malic_acid        0.001463631 0.0008971929 0.003525574         0.0011496141
#ash               0.000866467 0.0009069961 0.001087754         0.0006250729
#alcalinity_of_ash 0.002586450 0.0011787330 0.003161124         0.0013858065
#magnesium         0.004558269 0.0014140800 0.001497403         0.0018144971
#total_phenols     0.006585248 0.0011172624 0.004056111         0.0027482817

r$localImportance
#NULL

r$proximity
#NULL

r$ntree
#[1] 500

r$mtry
#[1] 3

names(r$forest)
# [1] "ndbigtree"  "nodestatus" "bestvar"    "treemap"    "nodepred"   "xbestsplit"
# [7] "pid"        "cutoff"     "ncat"       "maxcat"     "nrnodes"    "ntree"     
#[13] "nclass"     "xlevels"

head(r$y)
#     11      12      13      14      15      16 
#class_1 class_1 class_1 class_1 class_1 class_1 
#Levels: class_1 class_2 class_3

r$test
#NULL

r$inbag
#NULL

Continue reading

Kobe Byrant and the 2012 Lakers

Kobe Byrant and the Lakers (11-14) aren't doing as well as I had expected given the team they acquired in the off season. Everyone likes to point out that when he scores over x number of points (e.g. 30), the Lakers have lost more than they have won. So I took his stats for this season and had a look.

Using Random Forests™, we can assess the importance of each predictor variable (his stats) in predicting the outcome of the game (win/lost).

Continue reading

Explaining PCA to a school child

Ed Yong asked on Twitter “Explain principal component analysis to a schoolchild in a tweet.” Since I can’t explain PCA eloquently, I found this interesting and wanted to keep a record of the replies for future reference. Here are some of the modified replies, with my favourite first (and the rest in no particular order):

• Something nerds do to show off and that the rest of us secretly don't understand but always look knowingly when it's mentioned. @richboden
• Compare the height, eye color, hair length, etc. of the people around you. Where do you notice the most difference? @holtchesley
• It's a way to take any clues or observations you have about something and use math to see how much you can figure out about it. @ShipLives
• Capturing the essence of a data set in a reduced form. Think of describing the premiership in terms of Man U, Chelsea and Arsenal @fastconvergence
• Imagine lining the crayons in a box up based on their color similarity. Now, in the other direction, separate them by the wear. @APV2600
• "If you have to talk about something really complicated but you can only use two or three numbers, what do you do?" @blakestacey
• PCA is like picking cookies with the most chocolate chips and seeing where they came from in the jar. @holy_kau
• If you have a jellybean-shaped cloud of points, PCA figures out long side and short side of cloud and squishes it to get a ball. @geomblog
• PCA is a way of finding basic similarities between many different things - linking complexities by their simplest similarities @aimeemax
• PCA is a math method that converts data into a map. Objects (countries) that are closeby have similar parameters (coordinates) @scimomof2
• It's like representing a road with a line on a map. The road has width and length, but the length is sufficient for most purposes @CaldenWloka
• PCA is like taking all your responses to 'what is PCA' and condensing their essence into a single retweet. @ErikJCox
• A colleague once summed up PCA as a way of letting data decide for itself which patterns best describe it. @J_Liptak
• PCA tells you which features best describe the differences between items in a group, like colour in a box of lollies @pickleswarlz
• PCA of leaf shape would first give size, then fatness (width:length), wiggles of leaf edge, etc. 1st parts give largest variation @mickresearch
• Imagine collecting all the rocks in the yard, but you can only keep 10. PCA picks the ones with the most common color & sizes @kristenobacter
• Best soccer teams: do they score more or have the best defense ? PCA ranks values of teams based on multiple data of this kind. @tomroud