Analysing miRNA expression in cancers

MiRNAs are a class of small RNAs that when expressed usually down regulates the expression of its target transcript by binding to it and causing it to degrade or inhibiting it from being translated. There has been a lot of interest in studying the expression pattern of miRNAs, especially in relation to cancer, since their mis-regulation and aberrant expression, may cause aberrant expression of other transcripts. If for example, a miRNA regulating a oncogene (a gene that has the potential to promoter cancer when over expressed) is under expressed, then the oncogene may become over active and switch on cellular processes that promoter cancer. On the other hand, if a miRNA is aberrantly over expressed and its target is a gene that protects us from the over proliferation of cells, i.e. a tumour suppressor gene, then we lose this protective ability, from which a cancer may form.

There are thousands of known miRNAs and the aberrant expression of a specific miRNA may have different consequences and in the context of cancers, it may cause cancer in a different tissue. Thus to investigate which miRNAs are mis-expressed in various cancers, I downloaded 5 miRNA datasets from The Cancer Genome Atlas, corresponding to Acute Myeloid Leukemia (188 samples), Bladder Urothelial Carcinoma (208 samples), Brain Lower Grade Glioma (299 samples), Liver hepatocellular carcinoma (186 samples) and Lung adenocarcinoma (482 samples).

Continue reading


RNAfold is a program that calculates secondary structures of RNAs. RNAfold reads RNA sequences from stdin, calculates their minimum free energy (mfe) structure and prints to stdout the mfe structure in bracket notation and its free energy. My understanding is that the lowest energy structure i.e. minimum free energy, is the most thermodynamical stable one. Here I use the RNAfold web server to calculate the secondary structure of a pre-miRNA.

From miRBase, download hairpin.fa and use let-7a-1 as an example:

>hsa-let-7a-1 MI0000060 Homo sapiens let-7a-1 stem-loop

Input the sequence in the RNAfold web server:

Results for minimum free energy prediction

The optimal secondary structure in dot-bracket notation with a minimum free energy of -36.10 kcal/mol is given below.
 [color by base-pairing probability | color by positional entropy | no coloring]
1      (((((.(((..((((((((((((((((...(((.....))).((....))....))))))))))))))))..))))))))

I took the 80bp sequence and randomised it 5 times and used these random sequences as input to RNAfold to compare minimum free energies:


Minimum free energy of -13.40 kcal/mol


Minimum free energy of -18.00 kcal/mol


Minimum free energy of -15.60 kcal/mol


Minimum free energy of -21.90 kcal/mol


Minimum free energy of -21.41 kcal/mol

Results for thermodynamic ensemble prediction

The free energy of the thermodynamic ensemble is -37.53 kcal/mol.
The frequency of the MFE structure in the ensemble is 9.86 %.
The ensemble diversity is 5.89 .

Pre-miRNAs are processed by Dicer into the mature miRNAs creating two overhangs. The corresponding mature miRNAs are also available at miRBase.

>hsa-let-7a-5p MIMAT0000062 Homo sapiens let-7a-5p
>hsa-let-7a-3p MIMAT0004481 Homo sapiens let-7a-3p
>hsa-let-7a-2-3p MIMAT0010195 Homo sapiens let-7a-2-3p

Below I align the 5' and 3' mature let-7a-1 miRNAs back to the pre-miRNA, also including the secondary structure prediction and delimiting the miRNA boundaries:

1               (((((.(((..((((((((((((((((...(((.....))).((....))....))))))))))))))))..))))))))
mirna           -----UGAGGUAGUAGGUUGUAUAGUU-----------------------------CUAUACAAUCUACUGUCUUUC---
                     **********************                             *********************
                   |                                                                        |
                                          |                           |

And finally in graphical form with the 5' and 3' mature miRNAs:

Why miRNA are 22 or 23 nucleotides long

Late last year I mapped random sized DNA sequences back to the genome. The purpose was simply to see how long sequenced reads needed to be before they could be uniquely mapped to the genome. I couldn't find the statistics on this, so I just did it myself. I didn't dwell on the results too much.

One day, I heard someone say miRNA's have definitely evolved to be 22-23 nt long. As with everything in nature, there's a reason why things are the way they are. So just now, I decided to look back at my results.

The result to look at is the number of perfect matches vs. the length of the random DNA fragment. When I started to make random sized fragments larger than 20 nt long, very few of the random fragments could map uniquely. At 21 nt long, roughly 0.1% (982 out of 1_000_000) of the random DNA fragments mapped to the genome. At 22nt long, the number of mappable random DNA fragments drops to 255 out of 1_000_000 and at 23nt, 65 out of 1_000_000. 22-23 nt long DNA sequences seem like a good compromise between gaining sequence uniqueness with the shortest possible length, which is probably why they are this length. Even if there is a point mutation in a 22 nt miRNA, it will most likely still only bind to its intended target.

Using blat to map short RNAs

Updated on 2013 November 5th to include mapping of piRNAs

I still use blat as my multi-purpose alignment tool despite it being developed over 10 years ago. For those needing a simple introduction to blat, see my using blat post. Now I was wondering if blat could aligned short RNAs; the definition of short RNAs is generally <200nt. So here I try aligned miRNAs, which are in the size range of 20-24nt and piRNAs, which are in the size range of 26-31nt.

Mapping human miRNAs using blat

Firstly let's download a set of mature miRNA sequences from miRBase:

#release 20
zcat mature.fa.gz | grep ">" | wc
  30424  152120 1678688
#how many human miRNAs?
zcat mature.fa.gz | grep Homo | wc
   2578   12890  135904
#save human miRNAs
zcat mature.fa.gz | grep -A1 Homo > mature_human.fa
cat mature_human.fa | tr U T > temp
mv temp mature_human.fa

Now let's blat using the default settings to the assembled hg19 chromosomes:

#   blat database query [-ooc=11.ooc] output.psl
blat ~/genome/hg19.fa mature_human.fa mature_human.psl
#Loaded 3095693983 letters in 25 sequences
#Searched 55887 bases in 2578 sequences
#no results

The default settings don't map any of the miRNAs. Let's use a stepSize of 1, and no thresholds on the score and identity:

#using GNU parallel to speed things up:
parallel blat -stepSize=1 -minScore=0 -minIdentity=0 ~/genome/hg19_chrom/chr{}.fa mature_human.fa mature_human_{}.psl ::: {1..22} X Y M
#how many miRNAs were mapped?
cat *.psl | awk '{print $10}' | grep hsa | sort -u | wc
#   2578    2578   38305

So we could map all the miRNAs using the maximum number of overlapping tiles and no score threshold. I've written a script that converts the psl output into a bed file. In addition the script only outputs the best scoring matches:

Let's use the script to output the best scoring matches in bed format:

cat *.psl > mature_human.psl mature_human.psl mature_human_best_score.bed
#Stored 2578 entries
cat mature_human_best_score.bed | wc
  17884  214608 1392785

#how many uniquely mapped miRNAs?
cat mature_human_best_score.bed | cut -f4 | sort | uniq -c | awk '$1==1 {print}' | wc
   2147    4294   49215

#how many multimapping positions for the most highly multimapped miRNA?
cat mature_human_best_score.bed | cut -f4 | sort | uniq -c | sort -k1rn | head
    768 hsa-miR-1273f
    768 hsa-miR-5096
    768 hsa-miR-619-5p
    757 hsa-miR-1273g-3p
    722 hsa-miR-1273h-5p
    612 hsa-miR-5095
    591 hsa-miR-5585-3p
    584 hsa-miR-548ap-3p
    510 hsa-miR-548ar-5p
    507 hsa-miR-548ay-5p

Let's compare our uniquely mapped miRNAs with the genomic coordinates provided by miRBase:

##gff-version 3
##date 2013-10-1
# Chromosomal coordinates of Homo sapiens microRNAs
# microRNAs:               miRBase v20
# genome-build-id:         GRCh37.p5
# genome-build-accession:  NCBI_Assembly:GCA_000001405.6
# Hairpin precursor sequences have type "miRNA_primary_transcript".
# Note, these sequences do not represent the full primary transcript,
# rather a predicted stem-loop portion that includes the precursor
# miRNA. Mature sequences have type "miRNA".
chr1    .       miRNA_primary_transcript        17369   17436   .       -    .  ID=MI0022705;Alias=MI0022705;Name=hsa-mir-6859-1
chr1    .       miRNA   17409   17431   .       -       .       ID=MIMAT0027618;Alias=MIMAT0027618;Name=hsa-miR-6859-5p;Derives_from=MI0022705
chr1    .       miRNA   17369   17391   .       -       .       ID=MIMAT0027619;Alias=MIMAT0027619;Name=hsa-miR-6859-3p;Derives_from=MI0022705
chr1    .       miRNA_primary_transcript        30366   30503   .       +    .  ID=MI0006363;Alias=MI0006363;Name=hsa-mir-1302-2
chr1    .       miRNA   30438   30458   .       +       .       ID=MIMAT0005890;Alias=MIMAT0005890;Name=hsa-miR-1302;Derives_from=MI0006363
chr1    .       miRNA_primary_transcript        567705  567793  .       -    .  ID=MI0022558;Alias=MI0022558;Name=hsa-mir-6723
chr1    .       miRNA   567762  567783  .       -       .       ID=MIMAT0025855;Alias=MIMAT0025855;Name=hsa-miR-6723-5p;Derives_from=MI0022558

And by using this script:

#!/bin/env perl

use strict;
use warnings;

my $gff = 'hsa.gff3';
my $bed = 'mature_human_best_score.bed';

chomp(my $unique = `cat mature_human_best_score.bed | cut -f4 | sort | uniq -c | awk '\$1==1 {print}'`);
#      1 hsa-miR-98-5p

my %unique = ();
my @line = split(/\n/,$unique);
foreach my $entry (@line){
   if ($entry =~ /(hsa.*)$/){
      my $mirna = $1;
      $unique{$mirna} = 1;
   } else {
      die "Could not parse $entry\n";

my %bed = ();
open(BED,'<',$bed) || die "Could not open $bed: $!\n";
   my ($chr, $start, $end, $name, @etc) = split(/\t/);
   next unless exists $unique{$name};
   $start += 1;
   $bed{$name}{'CHR'} = $chr;
   $bed{$name}{'START'} = $start;
   $bed{$name}{'END'} = $end;

open(GFF,'<',$gff) || die "Could not open $gff: $!\n";
   next if /^#/;
   #chr1    .       miRNA   17409   17431   .       -       .       ID=MIMAT0027618;Alias=MIMAT0027618;Name=hsa-miR-6859-5p;Derives_from=MI0022705
   my ($chr, $source, $type, $start, $end, $score, $strand, $phase, $id) = split(/\t/);
   my $mirna = '';

   my @id_split = split(/;/,$id);
   foreach my $split (@id_split){
      if ($split =~ /Name=(.*)$/){
         $mirna = $1;
   next unless exists $bed{$mirna};

   if ($chr eq $bed{$mirna}{'CHR'} && $start == $bed{$mirna}{'START'} && $end == $bed{$mirna}{'END'}){
      print "Same\n";
   } else {
      print "Different: $mirna mapped: $bed{$mirna}{'CHR'}:$bed{$mirna}{'START'}-$bed{$mirna}{'END'} gff: $chr:$start-$end\n";


And here are the differently mapped miRNAs: | grep Diff
Different: hsa-miR-8070 mapped: chr11:11804631-11804652 gff: chr11:11804738-11804759
Different: hsa-miR-4487 mapped: chr11:47422535-47422553 gff: chr11:47422574-47422592
Different: hsa-miR-4441 mapped: chr2:240007394-240007410 gff: chr2:240007524-240007540
Different: hsa-miR-297 mapped: chr8:49860914-49860934 gff: chr4:111781780-111781800
Different: hsa-miR-3142 mapped: chr5:159901415-159901436 gff: chr5:159901457-159901478
Different: hsa-miR-450a-5p mapped: chrX:133674423-133674444 gff: chrX:133674594-133674615

So of the 2,147 uniquely mapped miRNAs using blat, 6 of them mapped differently to the coordinates provided by miRBase.

Mapping human piRNAs using blat

I will use human piRNAs generated from this post. Let's map these piRNAs in the same manner:

#using GNU parallel to speed things up:
parallel blat -stepSize=1 -minScore=0 -minIdentity=0 ~/genome/hg19_chrom/chr{}.fa human_pirna.fa human_pirna_{}.psl ::: {1..22} X Y M
#how many piRNAs were mapped?
cat human_pirna*.psl > human_pirna.psl
cat human_pirna.psl | awk '{print $10}' | grep -P "n\d+" | sort -u | wc
#  32152   32152  289368

cat human_pirna.fa | grep ">" | wc
# 32152  675404 3543949 human_pirna.psl human_pirna_best_score.bed

#how many uniquely mapped piRNAs?
cat human_pirna_best_score.bed | cut -f4 | sort | uniq -c | awk '$1==1 {print}' | wc
#  25960   51920  415360

#how many were mapped in total?
cat human_pirna_best_score.bed | cut -f4 | sort -u | wc
#  32152   32152  257216

#percentage of uniquely mapped?
bc -l<<<25960/32152*100


Using a stepSize of 1 and no thresholds on the scoring allowed the mapping of miRNAs and piRNAs using blat. The majority of uniquely mapped miRNAs from miRBase 20 were mapped to the location assigned by miRBase except for 6 exceptions. So it would seem that one could use blat for mapping short RNAs.