RNAfold

RNAfold is a program that calculates secondary structures of RNAs. RNAfold reads RNA sequences from stdin, calculates their minimum free energy (mfe) structure and prints to stdout the mfe structure in bracket notation and its free energy. My understanding is that the lowest energy structure i.e. minimum free energy, is the most thermodynamical stable one. Here I use the RNAfold web server to calculate the secondary structure of a pre-miRNA.

From miRBase, download hairpin.fa and use let-7a-1 as an example:

>hsa-let-7a-1 MI0000060 Homo sapiens let-7a-1 stem-loop
UGGGAUGAGGUAGUAGGUUGUAUAGUUUUAGGGUCACACC
CACCACUGGGAGAUAACUAUACAAUCUACUGUCUUUCCUA

Input the sequence in the RNAfold web server:

Results for minimum free energy prediction

The optimal secondary structure in dot-bracket notation with a minimum free energy of -36.10 kcal/mol is given below.
 [color by base-pairing probability | color by positional entropy | no coloring]
1      UGGGAUGAGGUAGUAGGUUGUAUAGUUUUAGGGUCACACCCACCACUGGGAGAUAACUAUACAAUCUACUGUCUUUCCUA
1      (((((.(((..((((((((((((((((...(((.....))).((....))....))))))))))))))))..))))))))

I took the 80bp sequence and randomised it 5 times and used these random sequences as input to RNAfold to compare minimum free energies:

>random1
AACGGUACUAUUCAAUUUGGCGCUUCGGUAUAACGUAAAA
UCCUAUAUGGUUCGUUUCCUCAGAGGGAAUGUUCGACGGA

Minimum free energy of -13.40 kcal/mol

>random2
UAUGCGAGGGGUUGGGUAAUGUAGCACCACAACCGUAUCU
AUCGUAAUGCUUAUAUGUAAACUUUUAGGUUCGCUGCACA

Minimum free energy of -18.00 kcal/mol

>random3
CUAUCUGCUUAUAUAUGAAAUCGGGUUCCGCCAUCCCAAA
ACAAAGCUUGACUGCUGGGGAUUAAUGUUGUUGGUAGUGA

Minimum free energy of -15.60 kcal/mol

>random4
AAACCCAAGGCAAUCGUUACAGUCGGUGUUUGAUUGAAUG
UGAAGAGAACGCUUCCCGUUACCGCUAUUUGGGUUUAAUU

Minimum free energy of -21.90 kcal/mol

>random5
AUCUGCCACGUACCUGACCUAGAAUUAUCUUGAGUUUGAG
UCUUACACUGGGGACAGGAUAUGAUUAAGCAGUUCUAGGU

Minimum free energy of -21.41 kcal/mol

Results for thermodynamic ensemble prediction

The free energy of the thermodynamic ensemble is -37.53 kcal/mol.
The frequency of the MFE structure in the ensemble is 9.86 %.
The ensemble diversity is 5.89 .

Pre-miRNAs are processed by Dicer into the mature miRNAs creating two overhangs. The corresponding mature miRNAs are also available at miRBase.

>hsa-let-7a-5p MIMAT0000062 Homo sapiens let-7a-5p
UGAGGUAGUAGGUUGUAUAGUU
>hsa-let-7a-3p MIMAT0004481 Homo sapiens let-7a-3p
CUAUACAAUCUACUGUCUUUC
>hsa-let-7a-2-3p MIMAT0010195 Homo sapiens let-7a-2-3p
CUGUACAGCCUCCUAGCUUUCC

Below I align the 5' and 3' mature let-7a-1 miRNAs back to the pre-miRNA, also including the secondary structure prediction and delimiting the miRNA boundaries:

hairpin         UGGGAUGAGGUAGUAGGUUGUAUAGUUUUAGGGUCACACCCACCACUGGGAGAUAACUAUACAAUCUACUGUCUUUCCUA
1               (((((.(((..((((((((((((((((...(((.....))).((....))....))))))))))))))))..))))))))
mirna           -----UGAGGUAGUAGGUUGUAUAGUU-----------------------------CUAUACAAUCUACUGUCUUUC---
                     **********************                             *********************
                   |                                                                        |
                   --------------------------------------------------------------------------
                                          |                           |
                                          -----------------------------

And finally in graphical form with the 5' and 3' mature miRNAs:

Mapping long reads with Bowtie

Just a simple test to see if Bowtie can map long reads. Why? Well because Bowtie is fast, so I want to see if I can also use it as a general purpose aligner. In a previous post I was characterising the mapability of the genome. From this I selected a portion of the genome that is unique, chr1:19472012-19482647 (10,635 bp). Then I made a bed file spanning different lengths of this portion:

chr1 19472013 19472043
chr1 19472013 19472113
chr1 19472013 19472213
chr1 19472013 19472513
chr1 19472013 19473013
chr1 19472013 19474013
chr1 19472013 19477013
chr1 19472013 19482013

Then using fastaFromBed:

fastaFromBed -fi /analysisdata/genomes/hg19_male.fa -bed unique.bed -fo unique.fa

So there are 8 sequences, with lengths 30, 100, 200, 500, 1,000, 2,000, 5,000 and 10,000.

First let's try to map these using blat.

blat -t=dna -q=dna -minIdentity=100 hg19.2bit unique.fa unique.psl
Loaded 3137161264 letters in 93 sequences
Searched 18830 bases in 8 sequences
cat unique.psl
psLayout version 3

match   mis-    rep.    N's     Q gap   Q gap   T gap   T gap strand    Q               Q       Q       Q       T               T       T       T       block   blockSizes      qStarts  tStarts
        match   match           count   bases   count   bases           name            size    start   end     name            size    start   end     count
---------------------------------------------------------------------------------------------------------------------------------------------------------------
100     0       0       0       0       0       0       0     + chr1:19472013-19472113  100     0       100     chr1    249250621       19472013        19472113        1       100,    0,  19472013,
200     0       0       0       0       0       0       0     + chr1:19472013-19472213  200     0       200     chr1    249250621       19472013        19472213        1       200,    0,  19472013,
500     0       0       0       0       0       0       0     + chr1:19472013-19472513  500     0       500     chr1    249250621       19472013        19472513        1       500,    0,  19472013,
1000    0       0       0       0       0       0       0     + chr1:19472013-19473013  1000    0       1000    chr1    249250621       19472013        19473013        1       1000,   0,  19472013,
2000    0       0       0       0       0       0       0     + chr1:19472013-19474013  2000    0       2000    chr1    249250621       19472013        19474013        1       2000,   0,  19472013,
5000    0       0       0       0       0       0       0     + chr1:19472013-19477013  5000    0       5000    chr1    249250621       19472013        19477013        1       5000,   0,  19472013,
10000   0       0       0       0       0       0       0     + chr1:19472013-19482013  10000   0       10000   chr1    249250621       19472013        19482013        1       10000,  0,  19472013,

Now let's see if Bowtie can map these "long reads":

bowtie2-build hg19.fa hg19
bowtie2 -x hg19 -f -U unique.fa -S unique.sam
8 reads; of these:
  8 (100.00%) were unpaired; of these:
    0 (0.00%) aligned 0 times
    7 (87.50%) aligned exactly 1 time
    1 (12.50%) aligned >1 times
100.00% overall alignment rate
cat unique.sam | grep -v "^@" | cut -f1-4
chr1:19472013-19472043  0       chr3    54652417
chr1:19472013-19472113  0       chr1    19472014
chr1:19472013-19472213  0       chr1    19472014
chr1:19472013-19472513  0       chr1    19472014
chr1:19472013-19473013  0       chr1    19472014
chr1:19472013-19474013  0       chr1    19472014
chr1:19472013-19477013  0       chr1    19472014
chr1:19472013-19482013  0       chr1    19472014

The 30mer could not be mapped by blat but was mapped by Bowtie but to another place. It looks like it may be possible to map long reads, up to 10,000 bp in length with Bowtie, at least with these trivial examples.

Things to keep in mind when using Bowtie2

Taken from the Bowtie2 manual.

For an alignment to be considered "valid" (i.e. "good enough") by Bowtie 2, it must have an alignment score no less than the minimum score threshold. The threshold is configurable and is expressed as a function of the read length. In end-to-end alignment mode (a global alignment), the default minimum score threhsold is -0.6 + -0.6 * L, where L is the read length. For example, for a 30 bp read the minimum score threshold would be -0.6 + (-0.6 * 30) = -18.6.

A mismatched base at a high-quality position in the read receives a penalty of -6 by default. A length-2 read gap receives a penalty of -11 by default (-5 for the gap open, -3 for the first extension, -3 for the second extension). Thus for a 30 bp read, an alignment is still considered valid with 3 mismatches (-18) and one gap + one mismatch (-14). I'm not sure what the threshold is for a high-quality position, however when aligning without base-calling qualities scores, every mismatch receives a penalty of -6.

The alignment scores are displayed in the SAM output as the AS field.

AS:i:
Alignment score. Can be negative. Can be greater than 0 in --local mode (but not in --end-to-end mode). Only present if SAM record is for an aligned read.

XS:i:
Alignment score for second-best alignment. Can be negative. Can be greater than 0 in --local mode (but not in --end-to-end mode). Only present if the SAM record is for an aligned read and more than one alignment was found for the read.

Other ways to increase sensitivity is to alter the seed length of reads (-L), the interval between extracted seeds (-i), and the number of mismatches permitted per seed (-N). See also: -D, which puts an upper limit on the number of dynamic programming problems (i.e. seed extensions) that can "fail" in a row before Bowtie 2 stops searching and see also: -R, which sets the maximum number of times Bowtie 2 will "re-seed" when attempting to align a read with repetitive seeds.