Analysing miRNA expression in cancers

MiRNAs are a class of small RNAs that when expressed usually down regulates the expression of its target transcript by binding to it and causing it to degrade or inhibiting it from being translated. There has been a lot of interest in studying the expression pattern of miRNAs, especially in relation to cancer, since their mis-regulation and aberrant expression, may cause aberrant expression of other transcripts. If for example, a miRNA regulating a oncogene (a gene that has the potential to promoter cancer when over expressed) is under expressed, then the oncogene may become over active and switch on cellular processes that promoter cancer. On the other hand, if a miRNA is aberrantly over expressed and its target is a gene that protects us from the over proliferation of cells, i.e. a tumour suppressor gene, then we lose this protective ability, from which a cancer may form.

There are thousands of known miRNAs and the aberrant expression of a specific miRNA may have different consequences and in the context of cancers, it may cause cancer in a different tissue. Thus to investigate which miRNAs are mis-expressed in various cancers, I downloaded 5 miRNA datasets from The Cancer Genome Atlas, corresponding to Acute Myeloid Leukemia (188 samples), Bladder Urothelial Carcinoma (208 samples), Brain Lower Grade Glioma (299 samples), Liver hepatocellular carcinoma (186 samples) and Lung adenocarcinoma (482 samples).

The normalised expression (reads per million) of a panel of miRNAs is given for every sample. In this analysis, I looked at the expression of 1,072 miRNAs across 1,363 samples from 5 different classes of cancers.

After downloading the datasets, there will be a bunch of *.mirna.quantification.txt files that contain the expression of miRNAs in a particular sample. I saved each class of cancer, into 5 directories, namely aml, blca, lgg, lihc and luad. Now to load the data into R:

#set the directory to the *.mirna.quantification.txt files for aml
setwd("aml/miRNASeq/BCGSC__IlluminaGA_miRNASeq/Level_3/")
aml_file <- list.files('.', pattern="*mirna.quantification.txt")
head(aml_file)
[1] "TCGA-AB-2802-03A-01T-0734-13.mirna.quantification.txt" "TCGA-AB-2803-03A-01T-0734-13.mirna.quantification.txt"
[3] "TCGA-AB-2805-03A-01T-0734-13.mirna.quantification.txt" "TCGA-AB-2806-03A-01T-0734-13.mirna.quantification.txt"
[5] "TCGA-AB-2807-03A-01T-0734-13.mirna.quantification.txt" "TCGA-AB-2808-03A-01T-0734-13.mirna.quantification.txt"

#merging multiple files: http://novicemetrics.blogspot.jp/2011/04/merging-multiple-data-files-into-one.html
#colClasses NULL trick to skip those columns
aml_data_list <- lapply(aml_file, function(x){ read.table(x, colClasses=c('character','NULL','numeric','NULL'), header=T)})
#there are warnings because the columns names are the same
#don't worry about this
aml_data <- Reduce(function(x,y) {merge(x,y,by='miRNA_ID')}, aml_data_list)
dim(aml_data)
[1] 705 189

aml_data[1,1:3]
      miRNA_ID reads_per_million_miRNA_mapped.x reads_per_million_miRNA_mapped.y
1 hsa-let-7a-1                         9239.468                         10699.25
#make miRNA_ID the row.names
row.names(aml_data) <- aml_data$miRNA_ID
#remove the first column
aml_data <- aml_data[,-1]
aml_data[1,1:3]
             reads_per_million_miRNA_mapped.x reads_per_million_miRNA_mapped.y reads_per_million_miRNA_mapped.x.1
hsa-let-7a-1                         9239.468                         10699.25                           6724.519
#rename the column names
aml_header <- aml_file
#-13.mirna.quantification.txt is redundant
aml_header <- gsub(pattern="-13.mirna.quantification.txt", replacement='', aml_header)
names(aml_data) <- aml_header
aml_data[1:3,1:3]
             TCGA-AB-2802-03A-01T-0734 TCGA-AB-2803-03A-01T-0734 TCGA-AB-2805-03A-01T-0734
hsa-let-7a-1                  9239.468                  10699.25                  6724.519
hsa-let-7a-2                 17917.031                  21746.64                 13272.049
hsa-let-7a-3                  9302.053                  10589.12                  6754.553
#705 miRNAs in 188 samples
dim(aml_data)
[1] 705 188

#now load the other 4 cancers: blca lgg lihc luad
setwd("blca/miRNASeq/BCGSC__IlluminaHiSeq_miRNASeq/Level_3/")
blca_file <- list.files('.', pattern="*mirna.quantification.txt")
blca_data_list <- lapply(blca_file, function(x){ read.table(x, colClasses=c('character','NULL','numeric','NULL'), header=T)})
blca_data <- Reduce(function(x,y) {merge(x,y,by='miRNA_ID')}, blca_data_list)
row.names(blca_data) <- blca_data$miRNA_ID
blca_data <- blca_data[,-1]
blca_header <- blca_file
blca_header <- gsub(pattern="-13.mirna.quantification.txt", replacement='', blca_header)
names(blca_data) <- blca_header
dim(blca_data)
[1] 1046  208

setwd("lihc/miRNASeq/BCGSC__IlluminaHiSeq_miRNASeq/Level_3/")
lihc_file <- list.files('.', pattern="*mirna.quantification.txt")
lihc_data_list <- lapply(lihc_file, function(x){ read.table(x, colClasses=c('character','NULL','numeric','NULL'), header=T)})
lihc_data <- Reduce(function(x,y) {merge(x,y,by='miRNA_ID')}, lihc_data_list)
row.names(lihc_data) <- lihc_data$miRNA_ID
lihc_data <- lihc_data[,-1]
lihc_header <- lihc_file
lihc_header <- gsub(pattern="-13.mirna.quantification.txt", replacement='', lihc_header)
names(lihc_data) <- lihc_header
dim(lihc_data)
[1] 1046  186

setwd("lgg/miRNASeq/BCGSC__IlluminaHiSeq_miRNASeq/Level_3/")
lgg_file <- list.files('.', pattern="*mirna.quantification.txt")
lgg_data_list <- lapply(lgg_file, function(x){ read.table(x, colClasses=c('character','NULL','numeric','NULL'), header=T)})
lgg_data <- Reduce(function(x,y) {merge(x,y,by='miRNA_ID')}, lgg_data_list)
row.names(lgg_data) <- lgg_data$miRNA_ID
lgg_data <- lgg_data[,-1]
lgg_header <- lgg_file
lgg_header <- gsub(pattern="-13.mirna.quantification.txt", replacement='', lgg_header)
names(lgg_data) <- lgg_header
dim(lgg_data)
[1] 1046  299

setwd("luad/miRNASeq/BCGSC__IlluminaHiSeq_miRNASeq/Level_3/")
luad_file <- list.files('.', pattern="*mirna.quantification.txt")
luad_data_list <- lapply(luad_file, function(x){ read.table(x, colClasses=c('character','NULL','numeric','NULL'), header=T)})
luad_data <- Reduce(function(x,y) {merge(x,y,by='miRNA_ID')}, luad_data_list)
row.names(luad_data) <- luad_data$miRNA_ID
luad_data <- luad_data[,-1]
luad_header <- luad_file
luad_header <- gsub(pattern="-13.mirna.quantification.txt", replacement='', luad_header)
names(luad_data) <- luad_header
dim(luad_data)
[1] 1046  482

Now that we have 5 individual data frames for each class of cancer, let's merge them together into one big table:

#merge all datasets 1 by 1
#aml blca lgg lihc luad
data <- merge(aml_data, blca_data, by="row.names", all=T)
#after each merge a new column called Row.names is created
#store Row.names as the row.names and remove the column
row.names(data) <- data$Row.names
data <- data[,-1]

#lgg
data <- merge(data, lgg_data, by="row.names", all=T)
row.names(data) <- data$Row.names
data <- data[,-1]

#lihc
data <- merge(data, lihc_data, by="row.names", all=T)
row.names(data) <- data$Row.names
data <- data[,-1]

#luad
data <- merge(data, luad_data, by="row.names", all=T)
row.names(data) <- data$Row.names
data <- data[,-1]

dim(data)
[1] 1071 1363

#sanity check
#aml blca lgg lihc luad
188 + 208 + 299 + 186 + 482
[1] 1363

So we have merged the 5 datasets into one big data frame called data, which contains 1071 rows (miRNAs) and 1363 columns (samples). I will use random forests to find which variables (miRNAs) are important in classifying the cancers into their respective classes.

#transpose dataset
data <- t(data)
data <- as.data.frame(data)
#add a type column to data
#and label rows to their cancer type
data$type <- c(rep('aml',188), rep('blca',208), rep('lgg',299), rep('lihc',186), rep('luad',482))
#make into factor
data$type <- as.factor(data$type)
#make all NAs into zero
data[is.na(data)] <- 0
#replace hyphens
names(data) <- gsub(pattern='-', replacement='_', x=names(data))

#check the data frame
data[c(1,189,397,696,1363),1070:1072]
                          hsa_mir_99a hsa_mir_99b type
TCGA-AB-2802-03A-01T-0734   209.52409     1993.20  aml
TCGA-BL-A0C8-01A-11R-A10V    33.83794    36105.35 blca
TCGA-CS-4938-01B-11R-1895  8569.32665    56089.27  lgg
TCGA-BC-4072-01B-11R-A154   644.22746    21531.59 lihc
TCGA-O1-A52J-01A-11H-A263   498.27723    22333.08 luad

#install the randomForest package if necessary
install.packages("randomForest")
#load library
library(randomForest)
r <- randomForest(type ~ ., data=data, importance=TRUE, do.trace=100)
r

ntree      OOB      1      2      3      4      5
  100:   1.39%  0.00%  3.37%  0.33%  1.61%  1.66%
  200:   1.39%  0.00%  3.85%  0.33%  1.08%  1.66%
  300:   1.17%  0.00%  2.40%  0.33%  1.08%  1.66%
  400:   1.03%  0.00%  2.40%  0.33%  1.08%  1.24%
  500:   0.88%  0.00%  2.40%  0.33%  1.08%  0.83%

r

Call:
 randomForest(formula = type ~ ., data = data, importance = TRUE,      do.trace = 100) 
               Type of random forest: classification
                     Number of trees: 500
No. of variables tried at each split: 32

        OOB estimate of  error rate: 0.88%
Confusion matrix:
     aml blca lgg lihc luad class.error
aml  188    0   0    0    0 0.000000000
blca   0  203   1    0    4 0.024038462
lgg    0    0 298    1    0 0.003344482
lihc   0    1   0  184    1 0.010752688
luad   0    4   0    0  478 0.008298755

#install if necessary
install.packages("pROC")
library(pROC)
roc(r$y, r$votes)

Call:
roc.default(response = r$y, predictor = r$votes)

Data: r$votes in 940 controls (r$y aml) < 1040 cases (r$y blca).
Area under the curve: 0.6578

rn <- round(importance(r), 2)
#order by MeanDecreaseAccuracy
head(rn[order(rn[,6], decreasing=TRUE),],10)
              aml  blca  lgg lihc  luad MeanDecreaseAccuracy MeanDecreaseGini
hsa_mir_375  3.88 12.00 4.81 3.37 13.52                13.49            29.13
hsa_mir_205  2.46 11.01 3.49 6.33  7.69                11.53            15.65
hsa_mir_10b  1.98  9.76 0.88 1.68  8.87                11.39            11.02
hsa_mir_122  2.53  7.76 3.85 9.92  8.79                10.85            16.73
hsa_mir_200c 5.35  7.43 5.43 8.56 10.44                10.34            25.18
hsa_mir_944  2.11 10.26 2.54 3.71  6.92                10.13             9.80
hsa_mir_141  3.69  7.40 5.29 8.08  9.85                 9.99            20.30
hsa_mir_196b 1.26 10.01 1.42 2.51  5.80                 9.99             8.10
hsa_mir_92b  2.42  5.99 4.05 7.10  8.75                 9.76            16.04
hsa_mir_584  2.21  8.52 1.82 3.82  8.00                 9.74             9.27

#install if necessary
install.packages("ggplot2")
library(ggplot2)
#one of many Brain Lower Grade Glioma specific miRNAs
ggplot(aes(y=hsa_mir_219_2, x = type), data = data) + geom_boxplot()

lgg_hsa_mir_219_2

By sorting on the variable importance values for classifying AML, we can find AML specific miRNAs:

head(rn[order(rn[,1], decreasing=TRUE),],10)
               aml blca  lgg lihc  luad MeanDecreaseAccuracy MeanDecreaseGini
hsa_mir_200c  5.35 7.43 5.43 8.56 10.44                10.34            25.18
hsa_mir_3934  5.22 5.26 5.12 6.08  8.39                 8.22            19.33
hsa_mir_500a  5.20 4.22 3.01 4.27  3.78                 5.07             8.21
hsa_mir_2355  4.67 3.55 2.36 3.85  3.31                 4.56             7.35
hsa_mir_219_2 4.61 5.10 7.21 5.40  5.11                 7.09            16.64
hsa_mir_1259  4.57 3.26 3.02 3.37  3.26                 4.54             5.60
hsa_mir_3607  4.53 3.47 3.81 4.63  3.35                 4.94             7.55
hsa_mir_1201  4.49 3.25 2.47 3.66  3.38                 4.38             6.28
hsa_mir_1975  4.41 3.79 2.72 3.85  4.01                 4.38             7.04
hsa_mir_3065  4.40 5.10 2.70 4.36  5.86                 5.40             7.48
ggplot(aes(y=hsa_mir_1975, x = type), data = data) + geom_boxplot()

aml_hsa_mir_1975

Hierarchical clustering of the samples based on miRNA expression:

#install sparcl package
install.packages("sparcl")
#load package
library(sparcl)

hh <- hclust(dist(data[,1:1071]))

type_as_numeric = as.numeric(data$type)
ColorDendrogram(hh, y=type_as_numeric, branchlength=230000)
legend("topright", c("Acute Myeloid Leukemia", "Bladder Urothelial Carcinoma", "Brain Lower Grade Glioma", "Liver hepatocellular carcinoma", "Lung adenocarcinoma"), col=c(2:6), pch=19)

hc_mirna_all

Conclusions

The unsupervised clustering of the different miRNA expression profiles showed the separation of the brain lower grade glioma samples from the other cancers, a main class of liver hepatocellular carcinomas and two sub-classes of acute myeloid leukemia (AML). We also observe heterogeneous expression profiles from various samples clustering with other types of cancers.

Examining the expression profile of miRNAs could potentially be a valuable diagnostic tool, especially given that miRNAs are quite stably present in blood. For example, we could build a classifier based on the panel of miRNAs and use it to classify what type of cancer a person has based on their miRNA expression pattern; this can be done using random forests. For this post, I only examined the variables (miRNAs) that were the most "important" in classifying a cancer. For example, hsa-mir-1975 was only expressed in AML and thus was an important miRNA in classifying AML from the other 4 cancers.

From the confusion matrix, it seems that the misclassification rate was extremely low. For a future update of this post (or a new post entirely), I will incorporate more cancer types and create a training and test set, for the testing of the classifier. If I have time, I will test other classifiers, as I did in an older post on predicting cancer.

Future reading

Random forests

miRNAs and cancer




Creative Commons License
This work is licensed under a Creative Commons
Attribution 4.0 International License
.
7 comments Add yours
  1. Interesting post! Why don’t you normalize/pre-process the data first though? For example through edgeR or RNASeq?

    1. Thanks for the comment. The data comes pre-normalised as parts per million/reads per million, which was what I used. I haven’t tried comparing the results with different normalisations though.

  2. Thank you! How to download miRNA datasets from The Cancer Genome Atlas, corresponding to Acute Myeloid Leukemia (188 samples)? Can you tell me detail steps? Thanks.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.