Analysing miRNA expression in cancers

MiRNAs are a class of small RNAs that when expressed usually down regulates the expression of its target transcript by binding to it and causing it to degrade or inhibiting it from being translated. There has been a lot of interest in studying the expression pattern of miRNAs, especially in relation to cancer, since their mis-regulation and aberrant expression, may cause aberrant expression of other transcripts. If for example, a miRNA regulating a oncogene (a gene that has the potential to promoter cancer when over expressed) is under expressed, then the oncogene may become over active and switch on cellular processes that promoter cancer. On the other hand, if a miRNA is aberrantly over expressed and its target is a gene that protects us from the over proliferation of cells, i.e. a tumour suppressor gene, then we lose this protective ability, from which a cancer may form.

There are thousands of known miRNAs and the aberrant expression of a specific miRNA may have different consequences and in the context of cancers, it may cause cancer in a different tissue. Thus to investigate which miRNAs are mis-expressed in various cancers, I downloaded 5 miRNA datasets from The Cancer Genome Atlas, corresponding to Acute Myeloid Leukemia (188 samples), Bladder Urothelial Carcinoma (208 samples), Brain Lower Grade Glioma (299 samples), Liver hepatocellular carcinoma (186 samples) and Lung adenocarcinoma (482 samples).

Continue reading

Sequence composition and random forests

Updated: 2013 November 28th

The sequence composition or the nucleotide composition at transcriptional starting sites (TSSs) of mRNAs are biased, i.e. certain nucleotides are preferred. Here I examine the sequence composition at the TSS of the NCBI Reference Sequence Database also known as RefSeq and use random forests to see if it's possible to train a classifier that can identify TSSs from random sequences. This work is done entirely using R and all code hosted on GitHub Gist.

Firstly, let's download the entire collection of RefSeqs using the R Bioconductor package biomaRt and create a data frame with the TSSs. Note that when working with transcripts, use the attributes 'transcript_start' and 'transcript_end', and not the attributes 'start_position' and 'end_position'; using those will give you the Ensembl gene coordinates.

Continue reading

Predicting cancer

So far I've come across four machine learning methods, which includes random forests, classification trees, hierarchical clustering and k-means clustering. Here I use all four of these methods (plus SVMs) towards predicting cancer, or more specifically malignant cancers using the Wisconsin breast cancer dataset.

wget http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data
cat breast-cancer-wisconsin.data | head -5
1000025,5,1,1,1,2,1,3,1,1,2
1002945,5,4,4,5,7,10,3,2,1,2
1015425,3,1,1,1,2,2,3,1,1,2
1016277,6,8,8,1,3,4,3,7,1,2
1017023,4,1,1,3,2,1,3,1,1,2
wget http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.names
cat breast-cancer-wisconsin.names | tail -25 | head -16
7. Attribute Information: (class attribute has been moved to last column)

   #  Attribute                     Domain
   -- -----------------------------------------
   1. Sample code number            id number
   2. Clump Thickness               1 - 10
   3. Uniformity of Cell Size       1 - 10
   4. Uniformity of Cell Shape      1 - 10
   5. Marginal Adhesion             1 - 10
   6. Single Epithelial Cell Size   1 - 10
   7. Bare Nuclei                   1 - 10
   8. Bland Chromatin               1 - 10
   9. Normal Nucleoli               1 - 10
  10. Mitoses                       1 - 10
  11. Class:                        (2 for benign, 4 for malignant)

Continue reading

Building a classification tree in R

In week 6 of the Data Analysis course offered freely on Coursera, there was a lecture on building classification trees in R (also known as decision trees). I thoroughly enjoyed the lecture and here I reiterate what was taught, both to re-enforce my memory and for sharing purposes.

I will jump straight into building a classification tree in R and explain the concepts along the way. We will use the iris dataset, which gives measurements in centimeters of the variables sepal length and width, and petal length and width, respectively, for 50 flowers from three different species of iris.

data(iris)
names(iris)
[1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width"  "Species"   
table(iris$Species)

    setosa versicolor  virginica 
        50         50         50
#install if necessary
install.packages("ggplot2")
library(ggplot2)
qplot(Petal.Width, Sepal.Width, data=iris, colour=Species, size=I(4))

Continue reading

K means clustering

Updated: 2014 March 13th

From Wikipedia:

k-means clustering is a method of vector quantization, originally from signal processing, that is popular for cluster analysis in data mining. k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster.

I will perform k-means clustering following the example provided at http://www.statmethods.net/advstats/cluster.html and using the wine dataset, which I previously analysed using random forests.

data <- read.table('http://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data',
                   header=F,
                   sep=',')

names(data) <- c('Class',
                 'Alcohol',
                 'Malic acid',
                 'Ash',
                 'Alcalinity of ash',
                 'Magnesium',
                 'Total phenols',
                 'Flavanoids',
                 'Nonflavanoid phenols',
                 'Proanthocyanins',
                 'Color intensity',
                 'Hue',
                 'OD280/OD315 of diluted wines',
                 'Proline')

#how many different classes?
table(data$Class)

 1  2  3 
59 71 48

rownames(data) <- paste(rownames(data), '_', data$Class, sep="")
data <- data[,-1]
head(data)
    Alcohol Malic acid  Ash Alcalinity of ash Magnesium Total phenols Flavanoids Nonflavanoid phenols
1_1   14.23       1.71 2.43              15.6       127          2.80       3.06                 0.28
2_1   13.20       1.78 2.14              11.2       100          2.65       2.76                 0.26
3_1   13.16       2.36 2.67              18.6       101          2.80       3.24                 0.30
4_1   14.37       1.95 2.50              16.8       113          3.85       3.49                 0.24
5_1   13.24       2.59 2.87              21.0       118          2.80       2.69                 0.39
6_1   14.20       1.76 2.45              15.2       112          3.27       3.39                 0.34
    Proanthocyanins Color intensity  Hue OD280/OD315 of diluted wines Proline
1_1            2.29            5.64 1.04                         3.92    1065
2_1            1.28            4.38 1.05                         3.40    1050
3_1            2.81            5.68 1.03                         3.17    1185
4_1            2.18            7.80 0.86                         3.45    1480
5_1            1.82            4.32 1.04                         2.93     735
6_1            1.97            6.75 1.05                         2.85    1450

wss <- (nrow(data)-1)*sum(apply(data,2,var))
for (i in 2:15)
   wss[i] <- sum(kmeans(data, centers=i)$withinss)
plot(1:15, wss, type="b", xlab="Number of Clusters",
   ylab="Within groups sum of squares")

number_of_clusters

5 seems to be a good number of paritions based on the within groups sum of squares.

fit <- kmeans(data, 5)
aggregate(data,by=list(fit$cluster),FUN=mean)
  Group.1  Alcohol Malic acid      Ash Alcalinity of ash Magnesium Total phenols Flavanoids
1       1 13.52750   1.925938 2.370938          17.72500 106.50000      2.725000   2.742500
2       2 12.35871   2.160000 2.240000          20.26774  90.77419      2.324516   2.114839
3       3 13.86000   1.793913 2.506957          17.07391 106.00000      2.943043   3.110870
4       4 12.94184   2.600204 2.385306          19.72245 103.10204      2.077959   1.531837
5       5 12.67860   2.758372 2.357907          21.29070  94.00000      1.854884   1.425116
  Nonflavanoid phenols Proanthocyanins Color intensity       Hue OD280/OD315 of diluted wines   Proline
1            0.2887500        1.875938        4.988750 1.0426875                     3.089062 1017.4375
2            0.3593548        1.571290        3.166129 1.0303226                     2.742581  383.4516
3            0.2986957        1.926087        6.260000 1.1000000                     3.035652 1338.5652
4            0.3908163        1.483878        5.714082 0.8728571                     2.342653  713.1020
5            0.4188372        1.335581        5.083256 0.8616279                     2.241860  529.6047

data_fit <- data.frame(data, fit$cluster)

library(cluster)
clusplot(data_fit, fit$cluster, color=T, shade=T, labels=2, lines=0, cex=0.4)

clusplot

We know that there are three classes of wines, what would our results look like if we chose three clusters?

fit <- kmeans(data, 3)
data_fit <- data.frame(data, fit$cluster)

#how many of the class 1 wines were clustered into the same cluster?
table(data_fit[grep("_1", row.names(data_fit)),]$fit.cluster)

 1  2 
13 46

#how many of the class 2 wines were clustered into the same cluster?
table(data_fit[grep("_2", row.names(data_fit)),]$fit.cluster)

 1  2  3 
20  1 50

#how many of the class 3 wines were clustered into the same cluster?
table(data_fit[grep("_3", row.names(data_fit)),]$fit.cluster)

 1  3 
29 19

Random Forests in predicting wines

Updated 2014 September 17th to reflect changes in the R packages

Source http://mkseo.pe.kr/stats/?p=220.

Using Random Forests in predicting wines derived from three different cultivars. Download the wine data set from the Machine Learning Repository.

#install if necessary
install.packages("randomForest")
library(randomForest)
#randomForest 4.6-10
#Type rfNews() to see new features/changes/bug fixes.

#I've hosted this data set on my Dropbox account
#here are the steps to load it
install.packages('RCurl')
library(RCurl)

if( !file.exists("cacert.pem") ){
  download.file(url="http://curl.haxx.se/ca/cacert.pem",
                destfile = "cacert.pem")
}

my_url <- getURL("https://dl.dropboxusercontent.com/u/15251811/wine.csv",
               cainfo = "cacert.pem")

data <- read.table(textConnection(my_url), header=T, sep=",")

head(data)
#    class alcohol malic_acid  ash alcalinity_of_ash magnesium total_phenols flavanoids
#1 class_1   14.23       1.71 2.43              15.6       127          2.80       3.06
#2 class_1   13.20       1.78 2.14              11.2       100          2.65       2.76
#3 class_1   13.16       2.36 2.67              18.6       101          2.80       3.24
#4 class_1   14.37       1.95 2.50              16.8       113          3.85       3.49
#5 class_1   13.24       2.59 2.87              21.0       118          2.80       2.69
#6 class_1   14.20       1.76 2.45              15.2       112          3.27       3.39
#  nonflavanoid_phenols proanthocyanins colour_intensity  hue od280_od315 proline
#1                 0.28            2.29             5.64 1.04        3.92    1065
#2                 0.26            1.28             4.38 1.05        3.40    1050
#3                 0.30            2.81             5.68 1.03        3.17    1185
#4                 0.24            2.18             7.80 0.86        3.45    1480
#5                 0.39            1.82             4.32 1.04        2.93     735
#6                 0.34            1.97             6.75 1.05        2.85    1450

table(data$class)

#class_1 class_2 class_3 
#     59      71      48

test <- data[ c(1:10, 60:69, 131:140), ]
train <- data[ c(11:59, 70:130, 141:178), ]

r <- randomForest(class ~ ., data=train, importance=TRUE, do.trace=100)
#ntree      OOB      1      2      3
#  100:   2.70%  2.04%  4.92%  0.00%
#  200:   1.35%  0.00%  3.28%  0.00%
#  300:   1.35%  0.00%  3.28%  0.00%
#  400:   1.35%  0.00%  3.28%  0.00%
#  500:   1.35%  0.00%  3.28%  0.00%

r

#Call:
# randomForest(formula = class ~ ., data = train, importance = TRUE,      do.trace = 100) 
#               Type of random forest: classification
#                     Number of trees: 500
#No. of variables tried at each split: 3
#
#        OOB estimate of  error rate: 1.35%
#Confusion matrix:
#        class_1 class_2 class_3 class.error
#class_1      49       0       0  0.00000000
#class_2       1      59       1  0.03278689
#class_3       0       0      38  0.00000000

names(r)
# [1] "call"            "type"            "predicted"       "err.rate"        "confusion"      
# [6] "votes"           "oob.times"       "classes"         "importance"      "importanceSD"   
#[11] "localImportance" "proximity"       "ntree"           "mtry"            "forest"         
#[16] "y"               "test"            "inbag"           "terms"

r$call
#randomForest(formula = class ~ ., data = train, importance = TRUE, 
#    do.trace = 100)

r$type
#[1] "classification"

head(r$predicted)
#     11      12      13      14      15      16 
#class_1 class_1 class_1 class_1 class_1 class_1 
#Levels: class_1 class_2 class_3

# two incorrect predictions
# as seen above
table(r$predicted == train$class)
#
#FALSE  TRUE 
#    2   146

head(r$err.rate)
#            OOB    class_1    class_2    class_3
#[1,] 0.10869565 0.12500000 0.10526316 0.09090909
#[2,] 0.06741573 0.05882353 0.05555556 0.10526316
#[3,] 0.07142857 0.05263158 0.06250000 0.11538462
#[4,] 0.11200000 0.02439024 0.09803922 0.24242424
#[5,] 0.10447761 0.02272727 0.12727273 0.17142857
#[6,] 0.11029412 0.04444444 0.10909091 0.19444444

r$confusion
#        class_1 class_2 class_3 class.error
#class_1      49       0       0  0.00000000
#class_2       1      59       1  0.03278689
#class_3       0       0      38  0.00000000

head(r$votes)
#     class_1     class_2     class_3
#11 0.9905660 0.009433962 0.000000000
#12 0.8409091 0.119318182 0.039772727
#13 0.9285714 0.060439560 0.010989011
#14 0.9891304 0.005434783 0.005434783
#15 0.9940828 0.005917160 0.000000000
#16 1.0000000 0.000000000 0.000000000

head(r$oob.times)
#[1] 170 201 168 182 182 165

r$classes
#[1] "class_1" "class_2" "class_3"

head(r$importance)
#                      class_1       class_2     class_3 MeanDecreaseAccuracy
#alcohol           0.153752764  0.1018091391 0.024288014          0.097357260
#malic_acid        0.004476793  0.0020282530 0.022202767          0.007776722
#ash               0.001933002  0.0025004258 0.003690918          0.002475941
#alcalinity_of_ash 0.019837640  0.0016648544 0.018789050          0.011811606
#magnesium         0.046735362  0.0094560643 0.003287392          0.019929169
#total_phenols     0.074401796 -0.0005645205 0.034750325          0.033061689
#                  MeanDecreaseGini
#alcohol                 14.5954250
#malic_acid               2.1264398
#ash                      0.8337735
#alcalinity_of_ash        3.0953357
#magnesium                3.0854575
#total_phenols            5.3164194

head(r$importanceSD)
#                      class_1      class_2     class_3 MeanDecreaseAccuracy
#alcohol           0.008291376 0.0056877344 0.003221596         0.0045790582
#malic_acid        0.001463631 0.0008971929 0.003525574         0.0011496141
#ash               0.000866467 0.0009069961 0.001087754         0.0006250729
#alcalinity_of_ash 0.002586450 0.0011787330 0.003161124         0.0013858065
#magnesium         0.004558269 0.0014140800 0.001497403         0.0018144971
#total_phenols     0.006585248 0.0011172624 0.004056111         0.0027482817

r$localImportance
#NULL

r$proximity
#NULL

r$ntree
#[1] 500

r$mtry
#[1] 3

names(r$forest)
# [1] "ndbigtree"  "nodestatus" "bestvar"    "treemap"    "nodepred"   "xbestsplit"
# [7] "pid"        "cutoff"     "ncat"       "maxcat"     "nrnodes"    "ntree"     
#[13] "nclass"     "xlevels"

head(r$y)
#     11      12      13      14      15      16 
#class_1 class_1 class_1 class_1 class_1 class_1 
#Levels: class_1 class_2 class_3

r$test
#NULL

r$inbag
#NULL

Continue reading

Visualising hierarchical clustering results

I've written about hierarchical clustering before as an attempt to understand it better. Within R, you can plot the hierarchical clustering results however when working with a large dataset you may produce plots like these where all the labels are overlapping:

and

As you can see you can't see any of the labels. During my honours year, I worked in a molecular evolution lab so I had learned about the Newick format, so I was wondering if there was a way to export the hierarchical clustering results into Newick format, so that I could use TreeView or some other tree visualising program to visualise or to manipulate the results. After some searching, I found this post from the Getting Genetics Done blog, which was exactly what I wanted.

So in almost the exact same instructions as the post I linked above:

#install if necessary
source("http://bioconductor.org/biocLite.R")
biocLite("ctc")

library(ctc)
#hierarchical clustering of my rows using average linkage
hc <- hclust(d <- as.dist(1 - cor(t(data), method="pearson")), method="average")
write.table(hc2Newick(hc), file="hc.newick")

Now open up hc.newick using your text editor of choice (e.g. Vim), and delete the first line, which should be "x" (with the double quotation marks). On the second line it should start as "1" " (with the double quotation marks), remove those so that the beginning of the line is now an open parenthesis i.e. (. Lastly go to the end of this line and remove the double quotation mark i.e. ", so that the end of the line is a semicolon i.e. ;.

Open your tree viewing program of choice, for example Dendroscope and open the hc.newick file. I prefer this program over TreeView because it allows you to zoom right into the tree (mouse scrolling) and to copy node information to the clipboard (as well as many other features!). Within Dendroscope, you can try different types of layouts, such as a circular dendrogram:

Colouring the branches

I learned of the sparcl package from this Stack Overflow post. Here using the iris dataset, which comes with R, I colour the branches according to the species.

#install sparcl package
install.packages("sparcl")
#load package
library(sparcl)
names(iris)
[1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width"  "Species"
hh <- hclust(dist(iris[,c(1:4)]))
table(iris$Species)

    setosa versicolor  virginica 
        50         50         50
species_as_numeric = as.numeric(iris$Species)
ColorDendrogram(hh,y=species_as_numeric,branchlength=2)

colour_hierarchical_clustering
The three species, setosa, versicolor and virginica, are mostly clustered together based on their sepal length, sepal width, petal length and petal width.