MiRNAs are a class of small RNAs that when expressed usually down regulates the expression of its target transcript by binding to it and causing it to degrade or inhibiting it from being translated. There has been a lot of interest in studying the expression pattern of miRNAs, especially in relation to cancer, since their mis-regulation and aberrant expression, may cause aberrant expression of other transcripts. If for example, a miRNA regulating a oncogene (a gene that has the potential to promoter cancer when over expressed) is under expressed, then the oncogene may become over active and switch on cellular processes that promoter cancer. On the other hand, if a miRNA is aberrantly over expressed and its target is a gene that protects us from the over proliferation of cells, i.e. a tumour suppressor gene, then we lose this protective ability, from which a cancer may form.
There are thousands of known miRNAs and the aberrant expression of a specific miRNA may have different consequences and in the context of cancers, it may cause cancer in a different tissue. Thus to investigate which miRNAs are mis-expressed in various cancers, I downloaded 5 miRNA datasets from The Cancer Genome Atlas, corresponding to Acute Myeloid Leukemia (188 samples), Bladder Urothelial Carcinoma (208 samples), Brain Lower Grade Glioma (299 samples), Liver hepatocellular carcinoma (186 samples) and Lung adenocarcinoma (482 samples).
The normalised expression (reads per million) of a panel of miRNAs is given for every sample. In this analysis, I looked at the expression of 1,072 miRNAs across 1,363 samples from 5 different classes of cancers.
After downloading the datasets, there will be a bunch of *.mirna.quantification.txt files that contain the expression of miRNAs in a particular sample. I saved each class of cancer, into 5 directories, namely aml, blca, lgg, lihc and luad. Now to load the data into R:
#set the directory to the *.mirna.quantification.txt files for aml
setwd("aml/miRNASeq/BCGSC__IlluminaGA_miRNASeq/Level_3/")
aml_file <- list.files('.', pattern="*mirna.quantification.txt")
head(aml_file)
[1] "TCGA-AB-2802-03A-01T-0734-13.mirna.quantification.txt" "TCGA-AB-2803-03A-01T-0734-13.mirna.quantification.txt"
[3] "TCGA-AB-2805-03A-01T-0734-13.mirna.quantification.txt" "TCGA-AB-2806-03A-01T-0734-13.mirna.quantification.txt"
[5] "TCGA-AB-2807-03A-01T-0734-13.mirna.quantification.txt" "TCGA-AB-2808-03A-01T-0734-13.mirna.quantification.txt"
#merging multiple files: http://novicemetrics.blogspot.jp/2011/04/merging-multiple-data-files-into-one.html
#colClasses NULL trick to skip those columns
aml_data_list <- lapply(aml_file, function(x){ read.table(x, colClasses=c('character','NULL','numeric','NULL'), header=T)})
#there are warnings because the columns names are the same
#don't worry about this
aml_data <- Reduce(function(x,y) {merge(x,y,by='miRNA_ID')}, aml_data_list)
dim(aml_data)
[1] 705 189
aml_data[1,1:3]
miRNA_ID reads_per_million_miRNA_mapped.x reads_per_million_miRNA_mapped.y
1 hsa-let-7a-1 9239.468 10699.25
#make miRNA_ID the row.names
row.names(aml_data) <- aml_data$miRNA_ID
#remove the first column
aml_data <- aml_data[,-1]
aml_data[1,1:3]
reads_per_million_miRNA_mapped.x reads_per_million_miRNA_mapped.y reads_per_million_miRNA_mapped.x.1
hsa-let-7a-1 9239.468 10699.25 6724.519
#rename the column names
aml_header <- aml_file
#-13.mirna.quantification.txt is redundant
aml_header <- gsub(pattern="-13.mirna.quantification.txt", replacement='', aml_header)
names(aml_data) <- aml_header
aml_data[1:3,1:3]
TCGA-AB-2802-03A-01T-0734 TCGA-AB-2803-03A-01T-0734 TCGA-AB-2805-03A-01T-0734
hsa-let-7a-1 9239.468 10699.25 6724.519
hsa-let-7a-2 17917.031 21746.64 13272.049
hsa-let-7a-3 9302.053 10589.12 6754.553
#705 miRNAs in 188 samples
dim(aml_data)
[1] 705 188
#now load the other 4 cancers: blca lgg lihc luad
setwd("blca/miRNASeq/BCGSC__IlluminaHiSeq_miRNASeq/Level_3/")
blca_file <- list.files('.', pattern="*mirna.quantification.txt")
blca_data_list <- lapply(blca_file, function(x){ read.table(x, colClasses=c('character','NULL','numeric','NULL'), header=T)})
blca_data <- Reduce(function(x,y) {merge(x,y,by='miRNA_ID')}, blca_data_list)
row.names(blca_data) <- blca_data$miRNA_ID
blca_data <- blca_data[,-1]
blca_header <- blca_file
blca_header <- gsub(pattern="-13.mirna.quantification.txt", replacement='', blca_header)
names(blca_data) <- blca_header
dim(blca_data)
[1] 1046 208
setwd("lihc/miRNASeq/BCGSC__IlluminaHiSeq_miRNASeq/Level_3/")
lihc_file <- list.files('.', pattern="*mirna.quantification.txt")
lihc_data_list <- lapply(lihc_file, function(x){ read.table(x, colClasses=c('character','NULL','numeric','NULL'), header=T)})
lihc_data <- Reduce(function(x,y) {merge(x,y,by='miRNA_ID')}, lihc_data_list)
row.names(lihc_data) <- lihc_data$miRNA_ID
lihc_data <- lihc_data[,-1]
lihc_header <- lihc_file
lihc_header <- gsub(pattern="-13.mirna.quantification.txt", replacement='', lihc_header)
names(lihc_data) <- lihc_header
dim(lihc_data)
[1] 1046 186
setwd("lgg/miRNASeq/BCGSC__IlluminaHiSeq_miRNASeq/Level_3/")
lgg_file <- list.files('.', pattern="*mirna.quantification.txt")
lgg_data_list <- lapply(lgg_file, function(x){ read.table(x, colClasses=c('character','NULL','numeric','NULL'), header=T)})
lgg_data <- Reduce(function(x,y) {merge(x,y,by='miRNA_ID')}, lgg_data_list)
row.names(lgg_data) <- lgg_data$miRNA_ID
lgg_data <- lgg_data[,-1]
lgg_header <- lgg_file
lgg_header <- gsub(pattern="-13.mirna.quantification.txt", replacement='', lgg_header)
names(lgg_data) <- lgg_header
dim(lgg_data)
[1] 1046 299
setwd("luad/miRNASeq/BCGSC__IlluminaHiSeq_miRNASeq/Level_3/")
luad_file <- list.files('.', pattern="*mirna.quantification.txt")
luad_data_list <- lapply(luad_file, function(x){ read.table(x, colClasses=c('character','NULL','numeric','NULL'), header=T)})
luad_data <- Reduce(function(x,y) {merge(x,y,by='miRNA_ID')}, luad_data_list)
row.names(luad_data) <- luad_data$miRNA_ID
luad_data <- luad_data[,-1]
luad_header <- luad_file
luad_header <- gsub(pattern="-13.mirna.quantification.txt", replacement='', luad_header)
names(luad_data) <- luad_header
dim(luad_data)
[1] 1046 482
Now that we have 5 individual data frames for each class of cancer, let's merge them together into one big table:
#merge all datasets 1 by 1 #aml blca lgg lihc luad data <- merge(aml_data, blca_data, by="row.names", all=T) #after each merge a new column called Row.names is created #store Row.names as the row.names and remove the column row.names(data) <- data$Row.names data <- data[,-1] #lgg data <- merge(data, lgg_data, by="row.names", all=T) row.names(data) <- data$Row.names data <- data[,-1] #lihc data <- merge(data, lihc_data, by="row.names", all=T) row.names(data) <- data$Row.names data <- data[,-1] #luad data <- merge(data, luad_data, by="row.names", all=T) row.names(data) <- data$Row.names data <- data[,-1] dim(data) [1] 1071 1363 #sanity check #aml blca lgg lihc luad 188 + 208 + 299 + 186 + 482 [1] 1363
So we have merged the 5 datasets into one big data frame called data, which contains 1071 rows (miRNAs) and 1363 columns (samples). I will use random forests to find which variables (miRNAs) are important in classifying the cancers into their respective classes.
#transpose dataset
data <- t(data)
data <- as.data.frame(data)
#add a type column to data
#and label rows to their cancer type
data$type <- c(rep('aml',188), rep('blca',208), rep('lgg',299), rep('lihc',186), rep('luad',482))
#make into factor
data$type <- as.factor(data$type)
#make all NAs into zero
data[is.na(data)] <- 0
#replace hyphens
names(data) <- gsub(pattern='-', replacement='_', x=names(data))
#check the data frame
data[c(1,189,397,696,1363),1070:1072]
hsa_mir_99a hsa_mir_99b type
TCGA-AB-2802-03A-01T-0734 209.52409 1993.20 aml
TCGA-BL-A0C8-01A-11R-A10V 33.83794 36105.35 blca
TCGA-CS-4938-01B-11R-1895 8569.32665 56089.27 lgg
TCGA-BC-4072-01B-11R-A154 644.22746 21531.59 lihc
TCGA-O1-A52J-01A-11H-A263 498.27723 22333.08 luad
#install the randomForest package if necessary
install.packages("randomForest")
#load library
library(randomForest)
r <- randomForest(type ~ ., data=data, importance=TRUE, do.trace=100)
r
ntree OOB 1 2 3 4 5
100: 1.39% 0.00% 3.37% 0.33% 1.61% 1.66%
200: 1.39% 0.00% 3.85% 0.33% 1.08% 1.66%
300: 1.17% 0.00% 2.40% 0.33% 1.08% 1.66%
400: 1.03% 0.00% 2.40% 0.33% 1.08% 1.24%
500: 0.88% 0.00% 2.40% 0.33% 1.08% 0.83%
r
Call:
randomForest(formula = type ~ ., data = data, importance = TRUE, do.trace = 100)
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 32
OOB estimate of error rate: 0.88%
Confusion matrix:
aml blca lgg lihc luad class.error
aml 188 0 0 0 0 0.000000000
blca 0 203 1 0 4 0.024038462
lgg 0 0 298 1 0 0.003344482
lihc 0 1 0 184 1 0.010752688
luad 0 4 0 0 478 0.008298755
#install if necessary
install.packages("pROC")
library(pROC)
roc(r$y, r$votes)
Call:
roc.default(response = r$y, predictor = r$votes)
Data: r$votes in 940 controls (r$y aml) < 1040 cases (r$y blca).
Area under the curve: 0.6578
rn <- round(importance(r), 2)
#order by MeanDecreaseAccuracy
head(rn[order(rn[,6], decreasing=TRUE),],10)
aml blca lgg lihc luad MeanDecreaseAccuracy MeanDecreaseGini
hsa_mir_375 3.88 12.00 4.81 3.37 13.52 13.49 29.13
hsa_mir_205 2.46 11.01 3.49 6.33 7.69 11.53 15.65
hsa_mir_10b 1.98 9.76 0.88 1.68 8.87 11.39 11.02
hsa_mir_122 2.53 7.76 3.85 9.92 8.79 10.85 16.73
hsa_mir_200c 5.35 7.43 5.43 8.56 10.44 10.34 25.18
hsa_mir_944 2.11 10.26 2.54 3.71 6.92 10.13 9.80
hsa_mir_141 3.69 7.40 5.29 8.08 9.85 9.99 20.30
hsa_mir_196b 1.26 10.01 1.42 2.51 5.80 9.99 8.10
hsa_mir_92b 2.42 5.99 4.05 7.10 8.75 9.76 16.04
hsa_mir_584 2.21 8.52 1.82 3.82 8.00 9.74 9.27
#install if necessary
install.packages("ggplot2")
library(ggplot2)
#one of many Brain Lower Grade Glioma specific miRNAs
ggplot(aes(y=hsa_mir_219_2, x = type), data = data) + geom_boxplot()
By sorting on the variable importance values for classifying AML, we can find AML specific miRNAs:
head(rn[order(rn[,1], decreasing=TRUE),],10)
aml blca lgg lihc luad MeanDecreaseAccuracy MeanDecreaseGini
hsa_mir_200c 5.35 7.43 5.43 8.56 10.44 10.34 25.18
hsa_mir_3934 5.22 5.26 5.12 6.08 8.39 8.22 19.33
hsa_mir_500a 5.20 4.22 3.01 4.27 3.78 5.07 8.21
hsa_mir_2355 4.67 3.55 2.36 3.85 3.31 4.56 7.35
hsa_mir_219_2 4.61 5.10 7.21 5.40 5.11 7.09 16.64
hsa_mir_1259 4.57 3.26 3.02 3.37 3.26 4.54 5.60
hsa_mir_3607 4.53 3.47 3.81 4.63 3.35 4.94 7.55
hsa_mir_1201 4.49 3.25 2.47 3.66 3.38 4.38 6.28
hsa_mir_1975 4.41 3.79 2.72 3.85 4.01 4.38 7.04
hsa_mir_3065 4.40 5.10 2.70 4.36 5.86 5.40 7.48
ggplot(aes(y=hsa_mir_1975, x = type), data = data) + geom_boxplot()
Hierarchical clustering of the samples based on miRNA expression:
#install sparcl package
install.packages("sparcl")
#load package
library(sparcl)
hh <- hclust(dist(data[,1:1071]))
type_as_numeric = as.numeric(data$type)
ColorDendrogram(hh, y=type_as_numeric, branchlength=230000)
legend("topright", c("Acute Myeloid Leukemia", "Bladder Urothelial Carcinoma", "Brain Lower Grade Glioma", "Liver hepatocellular carcinoma", "Lung adenocarcinoma"), col=c(2:6), pch=19)
Conclusions
The unsupervised clustering of the different miRNA expression profiles showed the separation of the brain lower grade glioma samples from the other cancers, a main class of liver hepatocellular carcinomas and two sub-classes of acute myeloid leukemia (AML). We also observe heterogeneous expression profiles from various samples clustering with other types of cancers.
Examining the expression profile of miRNAs could potentially be a valuable diagnostic tool, especially given that miRNAs are quite stably present in blood. For example, we could build a classifier based on the panel of miRNAs and use it to classify what type of cancer a person has based on their miRNA expression pattern; this can be done using random forests. For this post, I only examined the variables (miRNAs) that were the most "important" in classifying a cancer. For example, hsa-mir-1975 was only expressed in AML and thus was an important miRNA in classifying AML from the other 4 cancers.
From the confusion matrix, it seems that the misclassification rate was extremely low. For a future update of this post (or a new post entirely), I will incorporate more cancer types and create a training and test set, for the testing of the classifier. If I have time, I will test other classifiers, as I did in an older post on predicting cancer.
Future reading

This work is licensed under a Creative Commons
Attribution 4.0 International License.



Interesting post! Why don’t you normalize/pre-process the data first though? For example through edgeR or RNASeq?
Thanks for the comment. The data comes pre-normalised as parts per million/reads per million, which was what I used. I haven’t tried comparing the results with different normalisations though.
Thank you for this post! So interesting!
Glad you found it interesting! And you’re welcome 🙂
Thank you! How to download miRNA datasets from The Cancer Genome Atlas, corresponding to Acute Myeloid Leukemia (188 samples)? Can you tell me detail steps? Thanks.
Hi,
detailed information can be found at https://tcga-data.nci.nih.gov/tcga/tcgaDownload.jsp.
Cheers,
Dave
very interesting post!