# Analysing miRNA expression in cancers

MiRNAs are a class of small RNAs that when expressed usually down regulates the expression of its target transcript by binding to it and causing it to degrade or inhibiting it from being translated. There has been a lot of interest in studying the expression pattern of miRNAs, especially in relation to cancer, since their mis-regulation and aberrant expression, may cause aberrant expression of other transcripts. If for example, a miRNA regulating a oncogene (a gene that has the potential to promoter cancer when over expressed) is under expressed, then the oncogene may become over active and switch on cellular processes that promoter cancer. On the other hand, if a miRNA is aberrantly over expressed and its target is a gene that protects us from the over proliferation of cells, i.e. a tumour suppressor gene, then we lose this protective ability, from which a cancer may form.

There are thousands of known miRNAs and the aberrant expression of a specific miRNA may have different consequences and in the context of cancers, it may cause cancer in a different tissue. Thus to investigate which miRNAs are mis-expressed in various cancers, I downloaded 5 miRNA datasets from The Cancer Genome Atlas, corresponding to Acute Myeloid Leukemia (188 samples), Bladder Urothelial Carcinoma (208 samples), Brain Lower Grade Glioma (299 samples), Liver hepatocellular carcinoma (186 samples) and Lung adenocarcinoma (482 samples).

The normalised expression (reads per million) of a panel of miRNAs is given for every sample. In this analysis, I looked at the expression of 1,072 miRNAs across 1,363 samples from 5 different classes of cancers.

After downloading the datasets, there will be a bunch of *.mirna.quantification.txt files that contain the expression of miRNAs in a particular sample. I saved each class of cancer, into 5 directories, namely aml, blca, lgg, lihc and luad. Now to load the data into R:

#set the directory to the *.mirna.quantification.txt files for aml
setwd("aml/miRNASeq/BCGSC__IlluminaGA_miRNASeq/Level_3/")
aml_file <- list.files('.', pattern="*mirna.quantification.txt")
[1] "TCGA-AB-2802-03A-01T-0734-13.mirna.quantification.txt" "TCGA-AB-2803-03A-01T-0734-13.mirna.quantification.txt"
[3] "TCGA-AB-2805-03A-01T-0734-13.mirna.quantification.txt" "TCGA-AB-2806-03A-01T-0734-13.mirna.quantification.txt"
[5] "TCGA-AB-2807-03A-01T-0734-13.mirna.quantification.txt" "TCGA-AB-2808-03A-01T-0734-13.mirna.quantification.txt"

#merging multiple files: http://novicemetrics.blogspot.jp/2011/04/merging-multiple-data-files-into-one.html
#colClasses NULL trick to skip those columns
#there are warnings because the columns names are the same
aml_data <- Reduce(function(x,y) {merge(x,y,by='miRNA_ID')}, aml_data_list)
dim(aml_data)
[1] 705 189

aml_data[1,1:3]
1 hsa-let-7a-1                         9239.468                         10699.25
#make miRNA_ID the row.names
row.names(aml_data) <- aml_data$miRNA_ID #remove the first column aml_data <- aml_data[,-1] aml_data[1,1:3] reads_per_million_miRNA_mapped.x reads_per_million_miRNA_mapped.y reads_per_million_miRNA_mapped.x.1 hsa-let-7a-1 9239.468 10699.25 6724.519 #rename the column names aml_header <- aml_file #-13.mirna.quantification.txt is redundant aml_header <- gsub(pattern="-13.mirna.quantification.txt", replacement='', aml_header) names(aml_data) <- aml_header aml_data[1:3,1:3] TCGA-AB-2802-03A-01T-0734 TCGA-AB-2803-03A-01T-0734 TCGA-AB-2805-03A-01T-0734 hsa-let-7a-1 9239.468 10699.25 6724.519 hsa-let-7a-2 17917.031 21746.64 13272.049 hsa-let-7a-3 9302.053 10589.12 6754.553 #705 miRNAs in 188 samples dim(aml_data) [1] 705 188 #now load the other 4 cancers: blca lgg lihc luad setwd("blca/miRNASeq/BCGSC__IlluminaHiSeq_miRNASeq/Level_3/") blca_file <- list.files('.', pattern="*mirna.quantification.txt") blca_data_list <- lapply(blca_file, function(x){ read.table(x, colClasses=c('character','NULL','numeric','NULL'), header=T)}) blca_data <- Reduce(function(x,y) {merge(x,y,by='miRNA_ID')}, blca_data_list) row.names(blca_data) <- blca_data$miRNA_ID
blca_data <- blca_data[,-1]
dim(blca_data)
[1] 1046  208

setwd("lihc/miRNASeq/BCGSC__IlluminaHiSeq_miRNASeq/Level_3/")
lihc_file <- list.files('.', pattern="*mirna.quantification.txt")
lihc_data <- Reduce(function(x,y) {merge(x,y,by='miRNA_ID')}, lihc_data_list)
row.names(lihc_data) <- lihc_data$miRNA_ID lihc_data <- lihc_data[,-1] lihc_header <- lihc_file lihc_header <- gsub(pattern="-13.mirna.quantification.txt", replacement='', lihc_header) names(lihc_data) <- lihc_header dim(lihc_data) [1] 1046 186 setwd("lgg/miRNASeq/BCGSC__IlluminaHiSeq_miRNASeq/Level_3/") lgg_file <- list.files('.', pattern="*mirna.quantification.txt") lgg_data_list <- lapply(lgg_file, function(x){ read.table(x, colClasses=c('character','NULL','numeric','NULL'), header=T)}) lgg_data <- Reduce(function(x,y) {merge(x,y,by='miRNA_ID')}, lgg_data_list) row.names(lgg_data) <- lgg_data$miRNA_ID
lgg_data <- lgg_data[,-1]
dim(lgg_data)
[1] 1046  299

row.names(luad_data) <- luad_data$miRNA_ID luad_data <- luad_data[,-1] luad_header <- luad_file luad_header <- gsub(pattern="-13.mirna.quantification.txt", replacement='', luad_header) names(luad_data) <- luad_header dim(luad_data) [1] 1046 482  Now that we have 5 individual data frames for each class of cancer, let's merge them together into one big table: #merge all datasets 1 by 1 #aml blca lgg lihc luad data <- merge(aml_data, blca_data, by="row.names", all=T) #after each merge a new column called Row.names is created #store Row.names as the row.names and remove the column row.names(data) <- data$Row.names
data <- data[,-1]

#lgg
data <- merge(data, lgg_data, by="row.names", all=T)
row.names(data) <- data$Row.names data <- data[,-1] #lihc data <- merge(data, lihc_data, by="row.names", all=T) row.names(data) <- data$Row.names
data <- data[,-1]

data <- merge(data, luad_data, by="row.names", all=T)
row.names(data) <- data$Row.names data <- data[,-1] dim(data) [1] 1071 1363 #sanity check #aml blca lgg lihc luad 188 + 208 + 299 + 186 + 482 [1] 1363  So we have merged the 5 datasets into one big data frame called data, which contains 1071 rows (miRNAs) and 1363 columns (samples). I will use random forests to find which variables (miRNAs) are important in classifying the cancers into their respective classes. #transpose dataset data <- t(data) data <- as.data.frame(data) #add a type column to data #and label rows to their cancer type data$type <- c(rep('aml',188), rep('blca',208), rep('lgg',299), rep('lihc',186), rep('luad',482))
#make into factor
data$type <- as.factor(data$type)
#make all NAs into zero
data[is.na(data)] <- 0
#replace hyphens
names(data) <- gsub(pattern='-', replacement='_', x=names(data))

#check the data frame
data[c(1,189,397,696,1363),1070:1072]
hsa_mir_99a hsa_mir_99b type
TCGA-AB-2802-03A-01T-0734   209.52409     1993.20  aml
TCGA-BL-A0C8-01A-11R-A10V    33.83794    36105.35 blca
TCGA-CS-4938-01B-11R-1895  8569.32665    56089.27  lgg
TCGA-BC-4072-01B-11R-A154   644.22746    21531.59 lihc

#install the randomForest package if necessary
install.packages("randomForest")
library(randomForest)
r <- randomForest(type ~ ., data=data, importance=TRUE, do.trace=100)
r

ntree      OOB      1      2      3      4      5
100:   1.39%  0.00%  3.37%  0.33%  1.61%  1.66%
200:   1.39%  0.00%  3.85%  0.33%  1.08%  1.66%
300:   1.17%  0.00%  2.40%  0.33%  1.08%  1.66%
400:   1.03%  0.00%  2.40%  0.33%  1.08%  1.24%
500:   0.88%  0.00%  2.40%  0.33%  1.08%  0.83%

r

Call:
randomForest(formula = type ~ ., data = data, importance = TRUE,      do.trace = 100)
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 32

OOB estimate of  error rate: 0.88%
Confusion matrix:
aml blca lgg lihc luad class.error
aml  188    0   0    0    0 0.000000000
blca   0  203   1    0    4 0.024038462
lgg    0    0 298    1    0 0.003344482
lihc   0    1   0  184    1 0.010752688
luad   0    4   0    0  478 0.008298755

#install if necessary
install.packages("pROC")
library(pROC)
roc(r$y, r$votes)

Call:
roc.default(response = r$y, predictor = r$votes)

Data: r$votes in 940 controls (r$y aml) < 1040 cases (r$y blca). Area under the curve: 0.6578 rn <- round(importance(r), 2) #order by MeanDecreaseAccuracy head(rn[order(rn[,6], decreasing=TRUE),],10) aml blca lgg lihc luad MeanDecreaseAccuracy MeanDecreaseGini hsa_mir_375 3.88 12.00 4.81 3.37 13.52 13.49 29.13 hsa_mir_205 2.46 11.01 3.49 6.33 7.69 11.53 15.65 hsa_mir_10b 1.98 9.76 0.88 1.68 8.87 11.39 11.02 hsa_mir_122 2.53 7.76 3.85 9.92 8.79 10.85 16.73 hsa_mir_200c 5.35 7.43 5.43 8.56 10.44 10.34 25.18 hsa_mir_944 2.11 10.26 2.54 3.71 6.92 10.13 9.80 hsa_mir_141 3.69 7.40 5.29 8.08 9.85 9.99 20.30 hsa_mir_196b 1.26 10.01 1.42 2.51 5.80 9.99 8.10 hsa_mir_92b 2.42 5.99 4.05 7.10 8.75 9.76 16.04 hsa_mir_584 2.21 8.52 1.82 3.82 8.00 9.74 9.27 #install if necessary install.packages("ggplot2") library(ggplot2) #one of many Brain Lower Grade Glioma specific miRNAs ggplot(aes(y=hsa_mir_219_2, x = type), data = data) + geom_boxplot()  By sorting on the variable importance values for classifying AML, we can find AML specific miRNAs: head(rn[order(rn[,1], decreasing=TRUE),],10) aml blca lgg lihc luad MeanDecreaseAccuracy MeanDecreaseGini hsa_mir_200c 5.35 7.43 5.43 8.56 10.44 10.34 25.18 hsa_mir_3934 5.22 5.26 5.12 6.08 8.39 8.22 19.33 hsa_mir_500a 5.20 4.22 3.01 4.27 3.78 5.07 8.21 hsa_mir_2355 4.67 3.55 2.36 3.85 3.31 4.56 7.35 hsa_mir_219_2 4.61 5.10 7.21 5.40 5.11 7.09 16.64 hsa_mir_1259 4.57 3.26 3.02 3.37 3.26 4.54 5.60 hsa_mir_3607 4.53 3.47 3.81 4.63 3.35 4.94 7.55 hsa_mir_1201 4.49 3.25 2.47 3.66 3.38 4.38 6.28 hsa_mir_1975 4.41 3.79 2.72 3.85 4.01 4.38 7.04 hsa_mir_3065 4.40 5.10 2.70 4.36 5.86 5.40 7.48 ggplot(aes(y=hsa_mir_1975, x = type), data = data) + geom_boxplot()  Hierarchical clustering of the samples based on miRNA expression: #install sparcl package install.packages("sparcl") #load package library(sparcl) hh <- hclust(dist(data[,1:1071])) type_as_numeric = as.numeric(data$type)
ColorDendrogram(hh, y=type_as_numeric, branchlength=230000)
legend("topright", c("Acute Myeloid Leukemia", "Bladder Urothelial Carcinoma", "Brain Lower Grade Glioma", "Liver hepatocellular carcinoma", "Lung adenocarcinoma"), col=c(2:6), pch=19)


### Conclusions

The unsupervised clustering of the different miRNA expression profiles showed the separation of the brain lower grade glioma samples from the other cancers, a main class of liver hepatocellular carcinomas and two sub-classes of acute myeloid leukemia (AML). We also observe heterogeneous expression profiles from various samples clustering with other types of cancers.

Examining the expression profile of miRNAs could potentially be a valuable diagnostic tool, especially given that miRNAs are quite stably present in blood. For example, we could build a classifier based on the panel of miRNAs and use it to classify what type of cancer a person has based on their miRNA expression pattern; this can be done using random forests. For this post, I only examined the variables (miRNAs) that were the most "important" in classifying a cancer. For example, hsa-mir-1975 was only expressed in AML and thus was an important miRNA in classifying AML from the other 4 cancers.

From the confusion matrix, it seems that the misclassification rate was extremely low. For a future update of this post (or a new post entirely), I will incorporate more cancer types and create a training and test set, for the testing of the classifier. If I have time, I will test other classifiers, as I did in an older post on predicting cancer.

Random forests

miRNAs and cancer

.
1. MO says:

Interesting post! Why don’t you normalize/pre-process the data first though? For example through edgeR or RNASeq?

1. Davo says:

Thanks for the comment. The data comes pre-normalised as parts per million/reads per million, which was what I used. I haven’t tried comparing the results with different normalisations though.

2. Paola says:

Thank you for this post! So interesting!

1. Davo says:

Glad you found it interesting! And you’re welcome 🙂

3. stata says:

Thank you! How to download miRNA datasets from The Cancer Genome Atlas, corresponding to Acute Myeloid Leukemia (188 samples)? Can you tell me detail steps? Thanks.

4. Sharmi Banerjee says:

very interesting post!

This site uses Akismet to reduce spam. Learn how your comment data is processed.