Updated on 2013 November 5th to include mapping of piRNAs
I still use blat as my multi-purpose alignment tool despite it being developed over 10 years ago. For those needing a simple introduction to blat, see my using blat post. Now I was wondering if blat could aligned short RNAs; the definition of short RNAs is generally <200nt. So here I try aligned miRNAs, which are in the size range of 20-24nt and piRNAs, which are in the size range of 26-31nt.
Mapping human miRNAs using blat
Firstly let's download a set of mature miRNA sequences from miRBase:
#release 20 wget ftp://mirbase.org/pub/mirbase/CURRENT/mature.fa.gz zcat mature.fa.gz | grep ">" | wc 30424 152120 1678688 #how many human miRNAs? zcat mature.fa.gz | grep Homo | wc 2578 12890 135904 #save human miRNAs zcat mature.fa.gz | grep -A1 Homo > mature_human.fa cat mature_human.fa | tr U T > temp mv temp mature_human.fa
Now let's blat using the default settings to the assembled hg19 chromosomes:
#usage: # blat database query [-ooc=11.ooc] output.psl blat ~/genome/hg19.fa mature_human.fa mature_human.psl #Loaded 3095693983 letters in 25 sequences #Searched 55887 bases in 2578 sequences #no results
The default settings don't map any of the miRNAs. Let's use a stepSize of 1, and no thresholds on the score and identity:
#using GNU parallel to speed things up: http://www.gnu.org/software/parallel/ parallel blat -stepSize=1 -minScore=0 -minIdentity=0 ~/genome/hg19_chrom/chr{}.fa mature_human.fa mature_human_{}.psl ::: {1..22} X Y M #how many miRNAs were mapped? cat *.psl | awk '{print $10}' | grep hsa | sort -u | wc # 2578 2578 38305
So we could map all the miRNAs using the maximum number of overlapping tiles and no score threshold. I've written a script that converts the psl output into a bed file. In addition the script only outputs the best scoring matches:
Let's use the script to output the best scoring matches in bed format:
cat *.psl > mature_human.psl psl_to_bed_best_score.pl mature_human.psl mature_human_best_score.bed #Stored 2578 entries cat mature_human_best_score.bed | wc 17884 214608 1392785 #how many uniquely mapped miRNAs? cat mature_human_best_score.bed | cut -f4 | sort | uniq -c | awk '$1==1 {print}' | wc 2147 4294 49215 #how many multimapping positions for the most highly multimapped miRNA? cat mature_human_best_score.bed | cut -f4 | sort | uniq -c | sort -k1rn | head 768 hsa-miR-1273f 768 hsa-miR-5096 768 hsa-miR-619-5p 757 hsa-miR-1273g-3p 722 hsa-miR-1273h-5p 612 hsa-miR-5095 591 hsa-miR-5585-3p 584 hsa-miR-548ap-3p 510 hsa-miR-548ar-5p 507 hsa-miR-548ay-5p
Let's compare our uniquely mapped miRNAs with the genomic coordinates provided by miRBase:
wget ftp://mirbase.org/pub/mirbase/CURRENT/genomes/hsa.gff3 ##gff-version 3 ##date 2013-10-1 # # Chromosomal coordinates of Homo sapiens microRNAs # microRNAs: miRBase v20 # genome-build-id: GRCh37.p5 # genome-build-accession: NCBI_Assembly:GCA_000001405.6 # # Hairpin precursor sequences have type "miRNA_primary_transcript". # Note, these sequences do not represent the full primary transcript, # rather a predicted stem-loop portion that includes the precursor # miRNA. Mature sequences have type "miRNA". # chr1 . miRNA_primary_transcript 17369 17436 . - . ID=MI0022705;Alias=MI0022705;Name=hsa-mir-6859-1 chr1 . miRNA 17409 17431 . - . ID=MIMAT0027618;Alias=MIMAT0027618;Name=hsa-miR-6859-5p;Derives_from=MI0022705 chr1 . miRNA 17369 17391 . - . ID=MIMAT0027619;Alias=MIMAT0027619;Name=hsa-miR-6859-3p;Derives_from=MI0022705 chr1 . miRNA_primary_transcript 30366 30503 . + . ID=MI0006363;Alias=MI0006363;Name=hsa-mir-1302-2 chr1 . miRNA 30438 30458 . + . ID=MIMAT0005890;Alias=MIMAT0005890;Name=hsa-miR-1302;Derives_from=MI0006363 chr1 . miRNA_primary_transcript 567705 567793 . - . ID=MI0022558;Alias=MI0022558;Name=hsa-mir-6723 chr1 . miRNA 567762 567783 . - . ID=MIMAT0025855;Alias=MIMAT0025855;Name=hsa-miR-6723-5p;Derives_from=MI0022558
And by using this script:
#!/bin/env perl use strict; use warnings; my $gff = 'hsa.gff3'; my $bed = 'mature_human_best_score.bed'; chomp(my $unique = `cat mature_human_best_score.bed | cut -f4 | sort | uniq -c | awk '\$1==1 {print}'`); # 1 hsa-miR-98-5p my %unique = (); my @line = split(/\n/,$unique); foreach my $entry (@line){ if ($entry =~ /(hsa.*)$/){ my $mirna = $1; $unique{$mirna} = 1; } else { die "Could not parse $entry\n"; } } my %bed = (); open(BED,'<',$bed) || die "Could not open $bed: $!\n"; while(<BED>){ chomp; my ($chr, $start, $end, $name, @etc) = split(/\t/); next unless exists $unique{$name}; $start += 1; $bed{$name}{'CHR'} = $chr; $bed{$name}{'START'} = $start; $bed{$name}{'END'} = $end; } close(BED); open(GFF,'<',$gff) || die "Could not open $gff: $!\n"; while(<GFF>){ chomp; next if /^#/; #chr1 . miRNA 17409 17431 . - . ID=MIMAT0027618;Alias=MIMAT0027618;Name=hsa-miR-6859-5p;Derives_from=MI0022705 my ($chr, $source, $type, $start, $end, $score, $strand, $phase, $id) = split(/\t/); my $mirna = ''; my @id_split = split(/;/,$id); foreach my $split (@id_split){ if ($split =~ /Name=(.*)$/){ $mirna = $1; } } next unless exists $bed{$mirna}; if ($chr eq $bed{$mirna}{'CHR'} && $start == $bed{$mirna}{'START'} && $end == $bed{$mirna}{'END'}){ print "Same\n"; } else { print "Different: $mirna mapped: $bed{$mirna}{'CHR'}:$bed{$mirna}{'START'}-$bed{$mirna}{'END'} gff: $chr:$start-$end\n"; } } close(GFF); exit(0);
And here are the differently mapped miRNAs:
compare.pl | grep Diff Different: hsa-miR-8070 mapped: chr11:11804631-11804652 gff: chr11:11804738-11804759 Different: hsa-miR-4487 mapped: chr11:47422535-47422553 gff: chr11:47422574-47422592 Different: hsa-miR-4441 mapped: chr2:240007394-240007410 gff: chr2:240007524-240007540 Different: hsa-miR-297 mapped: chr8:49860914-49860934 gff: chr4:111781780-111781800 Different: hsa-miR-3142 mapped: chr5:159901415-159901436 gff: chr5:159901457-159901478 Different: hsa-miR-450a-5p mapped: chrX:133674423-133674444 gff: chrX:133674594-133674615
So of the 2,147 uniquely mapped miRNAs using blat, 6 of them mapped differently to the coordinates provided by miRBase.
Mapping human piRNAs using blat
I will use human piRNAs generated from this post. Let's map these piRNAs in the same manner:
#using GNU parallel to speed things up: http://www.gnu.org/software/parallel/ parallel blat -stepSize=1 -minScore=0 -minIdentity=0 ~/genome/hg19_chrom/chr{}.fa human_pirna.fa human_pirna_{}.psl ::: {1..22} X Y M #how many piRNAs were mapped? cat human_pirna*.psl > human_pirna.psl cat human_pirna.psl | awk '{print $10}' | grep -P "n\d+" | sort -u | wc # 32152 32152 289368 cat human_pirna.fa | grep ">" | wc # 32152 675404 3543949 psl_to_bed_best_score.pl human_pirna.psl human_pirna_best_score.bed #how many uniquely mapped piRNAs? cat human_pirna_best_score.bed | cut -f4 | sort | uniq -c | awk '$1==1 {print}' | wc # 25960 51920 415360 #how many were mapped in total? cat human_pirna_best_score.bed | cut -f4 | sort -u | wc # 32152 32152 257216 #percentage of uniquely mapped? bc -l<<<25960/32152*100 #80.74147797959691465500
Conclusions
Using a stepSize of 1 and no thresholds on the scoring allowed the mapping of miRNAs and piRNAs using blat. The majority of uniquely mapped miRNAs from miRBase 20 were mapped to the location assigned by miRBase except for 6 exceptions. So it would seem that one could use blat for mapping short RNAs.
This work is licensed under a Creative Commons
Attribution 4.0 International License.