MiRNAs are a class of small RNAs that when expressed usually down regulates the expression of its target transcript by binding to it and causing it to degrade or inhibiting it from being translated. There has been a lot of interest in studying the expression pattern of miRNAs, especially in relation to cancer, since their mis-regulation and aberrant expression, may cause aberrant expression of other transcripts. If for example, a miRNA regulating a oncogene (a gene that has the potential to promoter cancer when over expressed) is under expressed, then the oncogene may become over active and switch on cellular processes that promoter cancer. On the other hand, if a miRNA is aberrantly over expressed and its target is a gene that protects us from the over proliferation of cells, i.e. a tumour suppressor gene, then we lose this protective ability, from which a cancer may form.
There are thousands of known miRNAs and the aberrant expression of a specific miRNA may have different consequences and in the context of cancers, it may cause cancer in a different tissue. Thus to investigate which miRNAs are mis-expressed in various cancers, I downloaded 5 miRNA datasets from The Cancer Genome Atlas, corresponding to Acute Myeloid Leukemia (188 samples), Bladder Urothelial Carcinoma (208 samples), Brain Lower Grade Glioma (299 samples), Liver hepatocellular carcinoma (186 samples) and Lung adenocarcinoma (482 samples).
The normalised expression (reads per million) of a panel of miRNAs is given for every sample. In this analysis, I looked at the expression of 1,072 miRNAs across 1,363 samples from 5 different classes of cancers.
After downloading the datasets, there will be a bunch of *.mirna.quantification.txt files that contain the expression of miRNAs in a particular sample. I saved each class of cancer, into 5 directories, namely aml, blca, lgg, lihc and luad. Now to load the data into R:
#set the directory to the *.mirna.quantification.txt files for aml setwd("aml/miRNASeq/BCGSC__IlluminaGA_miRNASeq/Level_3/") aml_file <- list.files('.', pattern="*mirna.quantification.txt") head(aml_file) [1] "TCGA-AB-2802-03A-01T-0734-13.mirna.quantification.txt" "TCGA-AB-2803-03A-01T-0734-13.mirna.quantification.txt" [3] "TCGA-AB-2805-03A-01T-0734-13.mirna.quantification.txt" "TCGA-AB-2806-03A-01T-0734-13.mirna.quantification.txt" [5] "TCGA-AB-2807-03A-01T-0734-13.mirna.quantification.txt" "TCGA-AB-2808-03A-01T-0734-13.mirna.quantification.txt" #merging multiple files: http://novicemetrics.blogspot.jp/2011/04/merging-multiple-data-files-into-one.html #colClasses NULL trick to skip those columns aml_data_list <- lapply(aml_file, function(x){ read.table(x, colClasses=c('character','NULL','numeric','NULL'), header=T)}) #there are warnings because the columns names are the same #don't worry about this aml_data <- Reduce(function(x,y) {merge(x,y,by='miRNA_ID')}, aml_data_list) dim(aml_data) [1] 705 189 aml_data[1,1:3] miRNA_ID reads_per_million_miRNA_mapped.x reads_per_million_miRNA_mapped.y 1 hsa-let-7a-1 9239.468 10699.25 #make miRNA_ID the row.names row.names(aml_data) <- aml_data$miRNA_ID #remove the first column aml_data <- aml_data[,-1] aml_data[1,1:3] reads_per_million_miRNA_mapped.x reads_per_million_miRNA_mapped.y reads_per_million_miRNA_mapped.x.1 hsa-let-7a-1 9239.468 10699.25 6724.519 #rename the column names aml_header <- aml_file #-13.mirna.quantification.txt is redundant aml_header <- gsub(pattern="-13.mirna.quantification.txt", replacement='', aml_header) names(aml_data) <- aml_header aml_data[1:3,1:3] TCGA-AB-2802-03A-01T-0734 TCGA-AB-2803-03A-01T-0734 TCGA-AB-2805-03A-01T-0734 hsa-let-7a-1 9239.468 10699.25 6724.519 hsa-let-7a-2 17917.031 21746.64 13272.049 hsa-let-7a-3 9302.053 10589.12 6754.553 #705 miRNAs in 188 samples dim(aml_data) [1] 705 188 #now load the other 4 cancers: blca lgg lihc luad setwd("blca/miRNASeq/BCGSC__IlluminaHiSeq_miRNASeq/Level_3/") blca_file <- list.files('.', pattern="*mirna.quantification.txt") blca_data_list <- lapply(blca_file, function(x){ read.table(x, colClasses=c('character','NULL','numeric','NULL'), header=T)}) blca_data <- Reduce(function(x,y) {merge(x,y,by='miRNA_ID')}, blca_data_list) row.names(blca_data) <- blca_data$miRNA_ID blca_data <- blca_data[,-1] blca_header <- blca_file blca_header <- gsub(pattern="-13.mirna.quantification.txt", replacement='', blca_header) names(blca_data) <- blca_header dim(blca_data) [1] 1046 208 setwd("lihc/miRNASeq/BCGSC__IlluminaHiSeq_miRNASeq/Level_3/") lihc_file <- list.files('.', pattern="*mirna.quantification.txt") lihc_data_list <- lapply(lihc_file, function(x){ read.table(x, colClasses=c('character','NULL','numeric','NULL'), header=T)}) lihc_data <- Reduce(function(x,y) {merge(x,y,by='miRNA_ID')}, lihc_data_list) row.names(lihc_data) <- lihc_data$miRNA_ID lihc_data <- lihc_data[,-1] lihc_header <- lihc_file lihc_header <- gsub(pattern="-13.mirna.quantification.txt", replacement='', lihc_header) names(lihc_data) <- lihc_header dim(lihc_data) [1] 1046 186 setwd("lgg/miRNASeq/BCGSC__IlluminaHiSeq_miRNASeq/Level_3/") lgg_file <- list.files('.', pattern="*mirna.quantification.txt") lgg_data_list <- lapply(lgg_file, function(x){ read.table(x, colClasses=c('character','NULL','numeric','NULL'), header=T)}) lgg_data <- Reduce(function(x,y) {merge(x,y,by='miRNA_ID')}, lgg_data_list) row.names(lgg_data) <- lgg_data$miRNA_ID lgg_data <- lgg_data[,-1] lgg_header <- lgg_file lgg_header <- gsub(pattern="-13.mirna.quantification.txt", replacement='', lgg_header) names(lgg_data) <- lgg_header dim(lgg_data) [1] 1046 299 setwd("luad/miRNASeq/BCGSC__IlluminaHiSeq_miRNASeq/Level_3/") luad_file <- list.files('.', pattern="*mirna.quantification.txt") luad_data_list <- lapply(luad_file, function(x){ read.table(x, colClasses=c('character','NULL','numeric','NULL'), header=T)}) luad_data <- Reduce(function(x,y) {merge(x,y,by='miRNA_ID')}, luad_data_list) row.names(luad_data) <- luad_data$miRNA_ID luad_data <- luad_data[,-1] luad_header <- luad_file luad_header <- gsub(pattern="-13.mirna.quantification.txt", replacement='', luad_header) names(luad_data) <- luad_header dim(luad_data) [1] 1046 482
Now that we have 5 individual data frames for each class of cancer, let's merge them together into one big table:
#merge all datasets 1 by 1 #aml blca lgg lihc luad data <- merge(aml_data, blca_data, by="row.names", all=T) #after each merge a new column called Row.names is created #store Row.names as the row.names and remove the column row.names(data) <- data$Row.names data <- data[,-1] #lgg data <- merge(data, lgg_data, by="row.names", all=T) row.names(data) <- data$Row.names data <- data[,-1] #lihc data <- merge(data, lihc_data, by="row.names", all=T) row.names(data) <- data$Row.names data <- data[,-1] #luad data <- merge(data, luad_data, by="row.names", all=T) row.names(data) <- data$Row.names data <- data[,-1] dim(data) [1] 1071 1363 #sanity check #aml blca lgg lihc luad 188 + 208 + 299 + 186 + 482 [1] 1363
So we have merged the 5 datasets into one big data frame called data, which contains 1071 rows (miRNAs) and 1363 columns (samples). I will use random forests to find which variables (miRNAs) are important in classifying the cancers into their respective classes.
#transpose dataset data <- t(data) data <- as.data.frame(data) #add a type column to data #and label rows to their cancer type data$type <- c(rep('aml',188), rep('blca',208), rep('lgg',299), rep('lihc',186), rep('luad',482)) #make into factor data$type <- as.factor(data$type) #make all NAs into zero data[is.na(data)] <- 0 #replace hyphens names(data) <- gsub(pattern='-', replacement='_', x=names(data)) #check the data frame data[c(1,189,397,696,1363),1070:1072] hsa_mir_99a hsa_mir_99b type TCGA-AB-2802-03A-01T-0734 209.52409 1993.20 aml TCGA-BL-A0C8-01A-11R-A10V 33.83794 36105.35 blca TCGA-CS-4938-01B-11R-1895 8569.32665 56089.27 lgg TCGA-BC-4072-01B-11R-A154 644.22746 21531.59 lihc TCGA-O1-A52J-01A-11H-A263 498.27723 22333.08 luad #install the randomForest package if necessary install.packages("randomForest") #load library library(randomForest) r <- randomForest(type ~ ., data=data, importance=TRUE, do.trace=100) r ntree OOB 1 2 3 4 5 100: 1.39% 0.00% 3.37% 0.33% 1.61% 1.66% 200: 1.39% 0.00% 3.85% 0.33% 1.08% 1.66% 300: 1.17% 0.00% 2.40% 0.33% 1.08% 1.66% 400: 1.03% 0.00% 2.40% 0.33% 1.08% 1.24% 500: 0.88% 0.00% 2.40% 0.33% 1.08% 0.83% r Call: randomForest(formula = type ~ ., data = data, importance = TRUE, do.trace = 100) Type of random forest: classification Number of trees: 500 No. of variables tried at each split: 32 OOB estimate of error rate: 0.88% Confusion matrix: aml blca lgg lihc luad class.error aml 188 0 0 0 0 0.000000000 blca 0 203 1 0 4 0.024038462 lgg 0 0 298 1 0 0.003344482 lihc 0 1 0 184 1 0.010752688 luad 0 4 0 0 478 0.008298755 #install if necessary install.packages("pROC") library(pROC) roc(r$y, r$votes) Call: roc.default(response = r$y, predictor = r$votes) Data: r$votes in 940 controls (r$y aml) < 1040 cases (r$y blca). Area under the curve: 0.6578 rn <- round(importance(r), 2) #order by MeanDecreaseAccuracy head(rn[order(rn[,6], decreasing=TRUE),],10) aml blca lgg lihc luad MeanDecreaseAccuracy MeanDecreaseGini hsa_mir_375 3.88 12.00 4.81 3.37 13.52 13.49 29.13 hsa_mir_205 2.46 11.01 3.49 6.33 7.69 11.53 15.65 hsa_mir_10b 1.98 9.76 0.88 1.68 8.87 11.39 11.02 hsa_mir_122 2.53 7.76 3.85 9.92 8.79 10.85 16.73 hsa_mir_200c 5.35 7.43 5.43 8.56 10.44 10.34 25.18 hsa_mir_944 2.11 10.26 2.54 3.71 6.92 10.13 9.80 hsa_mir_141 3.69 7.40 5.29 8.08 9.85 9.99 20.30 hsa_mir_196b 1.26 10.01 1.42 2.51 5.80 9.99 8.10 hsa_mir_92b 2.42 5.99 4.05 7.10 8.75 9.76 16.04 hsa_mir_584 2.21 8.52 1.82 3.82 8.00 9.74 9.27 #install if necessary install.packages("ggplot2") library(ggplot2) #one of many Brain Lower Grade Glioma specific miRNAs ggplot(aes(y=hsa_mir_219_2, x = type), data = data) + geom_boxplot()
By sorting on the variable importance values for classifying AML, we can find AML specific miRNAs:
head(rn[order(rn[,1], decreasing=TRUE),],10) aml blca lgg lihc luad MeanDecreaseAccuracy MeanDecreaseGini hsa_mir_200c 5.35 7.43 5.43 8.56 10.44 10.34 25.18 hsa_mir_3934 5.22 5.26 5.12 6.08 8.39 8.22 19.33 hsa_mir_500a 5.20 4.22 3.01 4.27 3.78 5.07 8.21 hsa_mir_2355 4.67 3.55 2.36 3.85 3.31 4.56 7.35 hsa_mir_219_2 4.61 5.10 7.21 5.40 5.11 7.09 16.64 hsa_mir_1259 4.57 3.26 3.02 3.37 3.26 4.54 5.60 hsa_mir_3607 4.53 3.47 3.81 4.63 3.35 4.94 7.55 hsa_mir_1201 4.49 3.25 2.47 3.66 3.38 4.38 6.28 hsa_mir_1975 4.41 3.79 2.72 3.85 4.01 4.38 7.04 hsa_mir_3065 4.40 5.10 2.70 4.36 5.86 5.40 7.48 ggplot(aes(y=hsa_mir_1975, x = type), data = data) + geom_boxplot()
Hierarchical clustering of the samples based on miRNA expression:
#install sparcl package install.packages("sparcl") #load package library(sparcl) hh <- hclust(dist(data[,1:1071])) type_as_numeric = as.numeric(data$type) ColorDendrogram(hh, y=type_as_numeric, branchlength=230000) legend("topright", c("Acute Myeloid Leukemia", "Bladder Urothelial Carcinoma", "Brain Lower Grade Glioma", "Liver hepatocellular carcinoma", "Lung adenocarcinoma"), col=c(2:6), pch=19)
Conclusions
The unsupervised clustering of the different miRNA expression profiles showed the separation of the brain lower grade glioma samples from the other cancers, a main class of liver hepatocellular carcinomas and two sub-classes of acute myeloid leukemia (AML). We also observe heterogeneous expression profiles from various samples clustering with other types of cancers.
Examining the expression profile of miRNAs could potentially be a valuable diagnostic tool, especially given that miRNAs are quite stably present in blood. For example, we could build a classifier based on the panel of miRNAs and use it to classify what type of cancer a person has based on their miRNA expression pattern; this can be done using random forests. For this post, I only examined the variables (miRNAs) that were the most "important" in classifying a cancer. For example, hsa-mir-1975 was only expressed in AML and thus was an important miRNA in classifying AML from the other 4 cancers.
From the confusion matrix, it seems that the misclassification rate was extremely low. For a future update of this post (or a new post entirely), I will incorporate more cancer types and create a training and test set, for the testing of the classifier. If I have time, I will test other classifiers, as I did in an older post on predicting cancer.
Future reading

This work is licensed under a Creative Commons
Attribution 4.0 International License.
Interesting post! Why don’t you normalize/pre-process the data first though? For example through edgeR or RNASeq?
Thanks for the comment. The data comes pre-normalised as parts per million/reads per million, which was what I used. I haven’t tried comparing the results with different normalisations though.
Thank you for this post! So interesting!
Glad you found it interesting! And you’re welcome 🙂
Thank you! How to download miRNA datasets from The Cancer Genome Atlas, corresponding to Acute Myeloid Leukemia (188 samples)? Can you tell me detail steps? Thanks.
Hi,
detailed information can be found at https://tcga-data.nci.nih.gov/tcga/tcgaDownload.jsp.
Cheers,
Dave
very interesting post!