Using blat to map short RNAs - Dave Tang's blog

Updated on 2013 November 5th to include mapping of piRNAs

I still use blat as my multi-purpose alignment tool despite it being developed over 10 years ago. For those needing a simple introduction to blat, see my using blat post. Now I was wondering if blat could aligned short RNAs; the definition of short RNAs is generally <200nt. So here I try aligned miRNAs, which are in the size range of 20-24nt and piRNAs, which are in the size range of 26-31nt.

Mapping human miRNAs using blat

Firstly let's download a set of mature miRNA sequences from miRBase:

#release 20
wget ftp://mirbase.org/pub/mirbase/CURRENT/mature.fa.gz
zcat mature.fa.gz | grep ">" | wc
  30424  152120 1678688
#how many human miRNAs?
zcat mature.fa.gz | grep Homo | wc
   2578   12890  135904
#save human miRNAs
zcat mature.fa.gz | grep -A1 Homo > mature_human.fa
cat mature_human.fa | tr U T > temp
mv temp mature_human.fa

Now let's blat using the default settings to the assembled hg19 chromosomes:

#usage:
#   blat database query [-ooc=11.ooc] output.psl
blat ~/genome/hg19.fa mature_human.fa mature_human.psl
#Loaded 3095693983 letters in 25 sequences
#Searched 55887 bases in 2578 sequences
#no results

The default settings don't map any of the miRNAs. Let's use a stepSize of 1, and no thresholds on the score and identity:

#using GNU parallel to speed things up: http://www.gnu.org/software/parallel/
parallel blat -stepSize=1 -minScore=0 -minIdentity=0 ~/genome/hg19_chrom/chr{}.fa mature_human.fa mature_human_{}.psl ::: {1..22} X Y M
#how many miRNAs were mapped?
cat *.psl | awk '{print $10}' | grep hsa | sort -u | wc
#   2578    2578   38305

So we could map all the miRNAs using the maximum number of overlapping tiles and no score threshold. I've written a script that converts the psl output into a bed file. In addition the script only outputs the best scoring matches:

Let's use the script to output the best scoring matches in bed format:

cat *.psl > mature_human.psl
psl_to_bed_best_score.pl mature_human.psl mature_human_best_score.bed
#Stored 2578 entries
cat mature_human_best_score.bed | wc
  17884  214608 1392785

#how many uniquely mapped miRNAs?
cat mature_human_best_score.bed | cut -f4 | sort | uniq -c | awk '$1==1 {print}' | wc
   2147    4294   49215

#how many multimapping positions for the most highly multimapped miRNA?
cat mature_human_best_score.bed | cut -f4 | sort | uniq -c | sort -k1rn | head
    768 hsa-miR-1273f
    768 hsa-miR-5096
    768 hsa-miR-619-5p
    757 hsa-miR-1273g-3p
    722 hsa-miR-1273h-5p
    612 hsa-miR-5095
    591 hsa-miR-5585-3p
    584 hsa-miR-548ap-3p
    510 hsa-miR-548ar-5p
    507 hsa-miR-548ay-5p

Let's compare our uniquely mapped miRNAs with the genomic coordinates provided by miRBase:

wget ftp://mirbase.org/pub/mirbase/CURRENT/genomes/hsa.gff3
##gff-version 3
##date 2013-10-1
#
# Chromosomal coordinates of Homo sapiens microRNAs
# microRNAs:               miRBase v20
# genome-build-id:         GRCh37.p5
# genome-build-accession:  NCBI_Assembly:GCA_000001405.6
#
# Hairpin precursor sequences have type "miRNA_primary_transcript".
# Note, these sequences do not represent the full primary transcript,
# rather a predicted stem-loop portion that includes the precursor
# miRNA. Mature sequences have type "miRNA".
#
chr1    .       miRNA_primary_transcript        17369   17436   .       -    .  ID=MI0022705;Alias=MI0022705;Name=hsa-mir-6859-1
chr1    .       miRNA   17409   17431   .       -       .       ID=MIMAT0027618;Alias=MIMAT0027618;Name=hsa-miR-6859-5p;Derives_from=MI0022705
chr1    .       miRNA   17369   17391   .       -       .       ID=MIMAT0027619;Alias=MIMAT0027619;Name=hsa-miR-6859-3p;Derives_from=MI0022705
chr1    .       miRNA_primary_transcript        30366   30503   .       +    .  ID=MI0006363;Alias=MI0006363;Name=hsa-mir-1302-2
chr1    .       miRNA   30438   30458   .       +       .       ID=MIMAT0005890;Alias=MIMAT0005890;Name=hsa-miR-1302;Derives_from=MI0006363
chr1    .       miRNA_primary_transcript        567705  567793  .       -    .  ID=MI0022558;Alias=MI0022558;Name=hsa-mir-6723
chr1    .       miRNA   567762  567783  .       -       .       ID=MIMAT0025855;Alias=MIMAT0025855;Name=hsa-miR-6723-5p;Derives_from=MI0022558

And by using this script:

#!/bin/env perl

use strict;
use warnings;

my $gff = 'hsa.gff3';
my $bed = 'mature_human_best_score.bed';

chomp(my $unique = `cat mature_human_best_score.bed | cut -f4 | sort | uniq -c | awk '\$1==1 {print}'`);
#      1 hsa-miR-98-5p

my %unique = ();
my @line = split(/\n/,$unique);
foreach my $entry (@line){
   if ($entry =~ /(hsa.*)$/){
      my $mirna = $1;
      $unique{$mirna} = 1;
   } else {
      die "Could not parse $entry\n";
   }
}

my %bed = ();
open(BED,'<',$bed) || die "Could not open $bed: $!\n";
while(<BED>){
   chomp;
   my ($chr, $start, $end, $name, @etc) = split(/\t/);
   next unless exists $unique{$name};
   $start += 1;
   $bed{$name}{'CHR'} = $chr;
   $bed{$name}{'START'} = $start;
   $bed{$name}{'END'} = $end;
}
close(BED);

open(GFF,'<',$gff) || die "Could not open $gff: $!\n";
while(<GFF>){
   chomp;
   next if /^#/;
   #chr1    .       miRNA   17409   17431   .       -       .       ID=MIMAT0027618;Alias=MIMAT0027618;Name=hsa-miR-6859-5p;Derives_from=MI0022705
   my ($chr, $source, $type, $start, $end, $score, $strand, $phase, $id) = split(/\t/);
   my $mirna = '';

   my @id_split = split(/;/,$id);
   foreach my $split (@id_split){
      if ($split =~ /Name=(.*)$/){
         $mirna = $1;
      }
   }
   next unless exists $bed{$mirna};

   if ($chr eq $bed{$mirna}{'CHR'} && $start == $bed{$mirna}{'START'} && $end == $bed{$mirna}{'END'}){
      print "Same\n";
   } else {
      print "Different: $mirna mapped: $bed{$mirna}{'CHR'}:$bed{$mirna}{'START'}-$bed{$mirna}{'END'} gff: $chr:$start-$end\n";
   }
}
close(GFF);

exit(0);

And here are the differently mapped miRNAs:

compare.pl | grep Diff
Different: hsa-miR-8070 mapped: chr11:11804631-11804652 gff: chr11:11804738-11804759
Different: hsa-miR-4487 mapped: chr11:47422535-47422553 gff: chr11:47422574-47422592
Different: hsa-miR-4441 mapped: chr2:240007394-240007410 gff: chr2:240007524-240007540
Different: hsa-miR-297 mapped: chr8:49860914-49860934 gff: chr4:111781780-111781800
Different: hsa-miR-3142 mapped: chr5:159901415-159901436 gff: chr5:159901457-159901478
Different: hsa-miR-450a-5p mapped: chrX:133674423-133674444 gff: chrX:133674594-133674615

So of the 2,147 uniquely mapped miRNAs using blat, 6 of them mapped differently to the coordinates provided by miRBase.

Mapping human piRNAs using blat

I will use human piRNAs generated from this post. Let's map these piRNAs in the same manner:

#using GNU parallel to speed things up: http://www.gnu.org/software/parallel/
parallel blat -stepSize=1 -minScore=0 -minIdentity=0 ~/genome/hg19_chrom/chr{}.fa human_pirna.fa human_pirna_{}.psl ::: {1..22} X Y M
#how many piRNAs were mapped?
cat human_pirna*.psl > human_pirna.psl
cat human_pirna.psl | awk '{print $10}' | grep -P "n\d+" | sort -u | wc
#  32152   32152  289368

cat human_pirna.fa | grep ">" | wc
# 32152  675404 3543949

psl_to_bed_best_score.pl human_pirna.psl human_pirna_best_score.bed

#how many uniquely mapped piRNAs?
cat human_pirna_best_score.bed | cut -f4 | sort | uniq -c | awk '$1==1 {print}' | wc
#  25960   51920  415360

#how many were mapped in total?
cat human_pirna_best_score.bed | cut -f4 | sort -u | wc
#  32152   32152  257216

#percentage of uniquely mapped?
bc -l<<<25960/32152*100
#80.74147797959691465500

Conclusions

Using a stepSize of 1 and no thresholds on the scoring allowed the mapping of miRNAs and piRNAs using blat. The majority of uniquely mapped miRNAs from miRBase 20 were mapped to the location assigned by miRBase except for 6 exceptions. So it would seem that one could use blat for mapping short RNAs.

This work is licensed under a Creative Commons
Attribution 4.0 International License.

Mapping human miRNAs using blat

Mapping human piRNAs using blat

Conclusions

Leave a Reply Cancel reply