Step by step analysis of defining the tRNA repertoire in the bovine genome (in the most recent version) based on the procedure published in this article.
1. Download bosTau5.fa.gz from http://hgdownload-test.cse.ucsc.edu/goldenPath/bosTau5/bigZips/ and gunzip
2. Download and install tRNAscan-SE from http://lowelab.ucsc.edu/software/tRNAscan-SE.tar.gz
I had problems installing tRNAscan-SE on Ubuntu 11.04 (64 bit) but found the solution here: http://happykhan.com/?p=545
3. Setup RepeatMasker (instructions can be followed here). I downloaded repeatmaskerlibraries-20110419.tar.gz and RepeatMasker-open-3-3-0.tar.gz.
4. ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/ncbi-blast-2.2.25+-x64-linux.tar.gz
5. Once tRNAscan-SE is up, perform the search on bosTau5.fa
./tRNAscan-SE -o bos_tau_5_trnascan_se.output -f bos_tau_5_trnascan_se_trna.output -m bos_tau_5_trnascan_se.log ~/genome/bosTau5.fa
tRNAscan-SE v.1.3 (March 2011) scan results (on host tan118-VirtualBox)
Started: Sat Sep 10 14:33:30 JST 2011
————————————————————
Sequence file(s) to search: /home/tan118/genome/bosTau5.fa
Search Mode: Eukaryotic
Results written to: bos_tau_5_trnascan_se.output
Output format: Tabular
Searching with: tRNAscan + EufindtRNA -> Cove
Covariance model: TRNA2-euk.cm
tRNAscan parameters: Strict
EufindtRNA parameters: Relaxed (Int Cutoff= -32.1)
tRNA secondary structure
predictions saved to: bos_tau_5_trnascan_se_trna.output
Search log saved in:
Search statistics saved in: bos_tau_5_trnascan_se.log
————————————————————
First-pass (tRNAscan/EufindtRNA) Stats:
—————
Sequences read: 11894
Seqs w/at least 1 hit: 6798
Bases read: 2918090798 (x2 for both strands)
Bases in tRNAs: 39937512
tRNAs predicted: 519405
Av. tRNA length: 76
Script CPU time: 6153.84 s
Scan CPU time: 1249.34 s
Scan speed: 4671.4 Kbp/sec
First pass search(es) ended: Sat Sep 10 16:41:51 JST 2011
Cove Stats:
———–
Candidate tRNAs read: 519405
Cove-confirmed tRNAs: 225682
Bases scanned by Cove: 48247536
% seq scanned by Cove: 0.8 %
Script CPU time: 1553.15 s
Cove CPU time: 22435.24 s
Scan speed: 2150.5 bp/sec
Cove analysis of tRNAs ended: Sun Sep 11 08:42:34 JST 2011
Summary
——–
Overall scan speed: 185915.6 bp/sec
tRNAs decoding Standard 20 AA: 31889
Selenocysteine tRNAs (TCA): 3157
Possible suppressor tRNAs (CTA,TTA): 243
tRNAs with undetermined/unknown isotypes: 1143
Predicted pseudogenes: 189250
——-
Total tRNAs: 225682
tRNAs with introns: 4279
| Ala-AGC: 6 | Ala-GGC: 33 | Ala-CGC: 2 | Ala-TGC: 29 | Gly-ACC: 122 | Gly-GCC: 1826 | Gly-CCC: 74 | Gly-TCC: 103 | Pro-GGG: 4 | Pro-CGG: 1 | Pro-TGG: 1 | Thr-AGT: 6 | Thr-GGT: 8 | Thr-CGT: 2 | Thr-TGT: 5 | Val-AAC: 3 | Val-GAC: 48 | Val-CAC: 4 | Val-TAC: 8 | Ser-AGA: 8 | Ser-GGA: 13 | Ser-CGA: 2 | Ser-TGA: 3 | Ser-ACT: 22 | Ser-GCT: 148 | Arg-ACG: 8 | Arg-GCG: 34 | Arg-CCG: 1 | Arg-TCG: 7 | Arg-CCT: 12 | Arg-TCT: 15 | Leu-AAG: 6 | Leu-GAG: 5 | Leu-CAG: 8 | Leu-TAG: 2 | Leu-CAA: 9 | Leu-TAA: 4 | Phe-AAA: 7 | Phe-GAA: 46 | Asn-ATT: 7 | Asn-GTT: 16 | Lys-CTT: 3 | Lys-TTT: 8 | Asp-ATC: 14 | Asp-GTC: 174 | Glu-CTC: 12 | Glu-TTC: 47 | His-ATG: 6 | His-GTG: 49 | Gln-CTG: 7 | Gln-TTG: 11 | Ile-AAT: 3 | Ile-GAT: 8 | Ile-TAT: 6 | Met-CAT: 2 | Tyr-ATA: 15 | Tyr-GTA: 94 | Supres-CTA: 6 | Supres-TTA: 9 | Cys-ACA: 136 | Cys-GCA: 918 | Trp-CCA: 55 | SelCys-TCA: 28 |
Isotype / Anticodon Counts:
Ala : 335 AGC: 61 GGC: 54 CGC: 41 TGC: 179
Gly : 8382 ACC: 209 GCC: 2631 CCC: 3431 TCC: 2111
Pro : 37 AGG: 14 GGG: 3 CGG: 7 TGG: 13
Thr : 55 AGT: 25 GGT: 3 CGT: 11 TGT: 16
Val : 258 AAC: 39 GAC: 55 CAC: 99 TAC: 65
Ser : 637 AGA: 41 GGA: 214 CGA: 22 TGA: 29 ACT: 39 GCT: 292
Arg : 932 ACG: 51 GCG: 234 CCG: 57 TCG: 26 CCT: 308 TCT: 256
Leu : 174 AAG: 17 GAG: 25 CAG: 19 TAG: 14 CAA: 63 TAA: 36
Phe : 414 AAA: 48 GAA: 366
Asn : 81 ATT: 13 GTT: 68
Lys : 358 CTT: 110 TTT: 248
Asp : 360 ATC: 37 GTC: 323
Glu : 2493 CTC: 618 TTC: 1875
His : 237 ATG: 36 GTG: 201
Gln : 173 CTG: 104 TTG: 69
Ile : 68 AAT: 28 GAT: 18 TAT: 22
Met : 46 CAT: 46
Tyr : 967 ATA: 179 GTA: 788
Supres: 243 CTA: 162 TTA: 81
Cys : 14133 ACA: 1930 GCA: 12203
Trp : 1749 CCA: 1749
SelCys: 3157 TCA: 3157
6. Make fasta file from tRNAscan-SE output using the script below.
#!/usr/bin/perl
use strict;
use warnings;
my $usage = "Usage: $0 <trnascan.output> <genome.fa> <output.fa>\n";
my $infile = shift or die $usage;
my $genome = shift or die $usage;
my $outfile = shift or die $usage;
my %genome = ();
my $current_acc = '';
open(IN,'<',$genome) || die "Could not open $genome: $!\n";
TEST: while(<IN>){
chomp;
#if you are working with bacterial genomes change the line below to
#if (/^>([A-Za-z0-9\|\.\_]+)\s+/){
if (/^>(.*)$/){
$current_acc = $1;
} else {
$genome{$current_acc} .= $_;
}
#last TEST unless $current_acc eq 'chr1';
}
close(IN);
warn "Stored genome into memory\n";
my %trna = ();
open (IN, '<', $infile) or die "Could not open $infile: $!\n";
open (OUT,'>', $outfile) or die "Could not open $outfile for writing: $!\n";
while (<IN>){
chomp;
#Sequence tRNA Bounds tRNA Anti Intron Bounds Cove
#Name tRNA # Begin End Type Codon Begin End Score
#-------- ------ ---- ------ ---- ----- ----- ---- ------
#chr1 1 32999 33071 Pseudo GCA 0 0 20.17
#chr1 2 34208 34280 Pseudo GCA 0 0 31.02
#chr1 3 34729 34800 Pseudo GAA 0 0 24.24
#chr1 3924 115232036 115232108 Lys TTT 0 0 79.55
next if (/^Sequence/);
next if (/^Name/);
next if (/^--/);
next if (/^$/);
my $strand = '+';
my ($acc, $trna_number, $trna_start, $trna_end, $trna_type, $anticodon, $intron_start, $intron_end, $cove_score) = split(/\s+/);
#tRNAscan likes to add a space to the coordinates randomly
$trna_start =~ s/\s+//g;
$trna_end =~ s/\s+//g;
if ($trna_start > $trna_end){
$strand = '-';
($trna_start, $trna_end) = ($trna_end, $trna_start);
}
my $length = $trna_end - $trna_start;
#substr is 0 based and coordinates in tRNAscan-SE are 1 based
my $seq = substr($genome{$acc},$trna_start-1,$length+1);
$seq = uc($seq);
if ($strand eq '-'){
$seq =~ tr/ACGT/TGCA/;
$seq = reverse($seq);
}
#my $id = "tRNA_${trna_number}_${acc}_${trna_start}_${trna_end}_${strand}_${trna_type}_${anticodon}_${cove_score}";
if ($acc =~ /chrUn/){
$acc =~ s/^chr//;
}
my $id = "${acc}_${trna_start}_${trna_end}_${strand}_${trna_type}_${anticodon}_${cove_score}";
print OUT ">$id\n$seq\n";
}
close(IN);
close(OUT);
exit(0);
7. Run RepeatMasker on the fasta file using these parameters:
./RepeatMasker -s -species cow bos_tau_5_trnascan_se_short_id.fa
8. Some stats from RepeatMasker
cat bos_tau_5_trnascan_se_short_id.fa.out | grep -P "\s+\d+" | awk '{print $5}' | sort -u | wc
225667 225667 9513704
cat bos_tau_5_trnascan_se_short_id.fa | grep ">" | wc
225682 225682 9740003
cat bos_tau_5_trnascan_se_short_id.fa.out | grep -P "\s+\d+" | awk '{print $11}' | sort | uniq -c | sort -k1rn | less
149028 SINE/BovA
74586 SINE/tRNA-Glu
2264 tRNA
126 LTR/ERV1
19 LINE/L1
13 Simple_repeat
12 Low_complexity
10 SINE/MIR
2 SINE/tRNA
2 Unknown
1 SINE/RTE-BovB
1 snRNA
Perl script called remove_repeat.pl used to remove putative tRNA entries tagged as something other than tRNA by RepeatMasker
#!/usr/bin/perl
use strict;
use warnings;
my $fa = 'bos_tau_5_trnascan_se_short_id.fa';
my $rm = 'bos_tau_5_trnascan_se_short_id.fa.out';
my %repeat = ();
open(IN,'<',$rm) || die "Could not open $rm: $!\n";
while(<IN>){
chomp;
next unless /^\s+\d+/;
my @line_split = split(/\s+/);
next if $line_split[11] eq 'tRNA';
$repeat{$line_split[5]} = '1';
}
close(IN);
open(IN,'<',$fa) || die "Could not open $fa: $!\n";
while(<IN>){
chomp;
if (/^>(.*)/){
my $defline = $1;
my $seq = <IN>;
chomp($seq);
next if (exists $repeat{$defline});
print ">$defline\n$seq\n";
} else {
die "Error in line $.: $_\n";
}
}
close(IN);
remove_repeat.pl | grep ">" | wc
2239 2239 93827
remove_repeat.pl | grep ">" | grep -v "Pseudo" | wc
1257 1257 50982
remove_repeat.pl | grep ">" | grep -v "Pseudo" | grep -v ">Un" | wc
1121 1121 45380
remove_repeat.pl | grep ">" | grep -v "Pseudo" | grep -v ">Un" | cut -f5 -d'_' | sort | uniq -c | sort -k1rn | less
253 Glu
105 Lys
75 Gly
66 Val
63 Ala
59 Leu
55 Asp
46 Ser
43 Gln
42 Arg
35 Asn
34 Thr
31 Met
31 Phe
30 Ile
29 Cys
28 Pro
24 Tyr
19 SeC
17 Undet
16 His
13 Trp
6 Sup
1 SeC(e)
remove_repeat.pl | grep ">" | grep -v "Pseudo" | grep -v ">Un" | cut -f5,6 -d'_' | sort | uniq -c | sort -k1rn | less
199 Glu_TTC
64 Lys_TTT
54 Glu_CTC
50 Asp_GTC
41 Lys_CTT
37 Ala_AGC
33 Asn_GTT
31 Gly_TCC
31 Met_CAT
29 Cys_GCA
29 Phe_GAA
28 Val_CAC
26 Gly_GCC
26 Val_AAC
23 Gln_CTG
23 Tyr_GTA
20 Gln_TTG
19 Ile_AAT
19 SeC_TCA
19 Thr_AGT
17 Gly_CCC
17 Ser_GCT
17 Undet_???
16 Ala_TGC
16 His_GTG
16 Leu_TAA
14 Leu_CAG
14 Pro_AGG
13 Arg_CCT
13 Leu_AAG
13 Ser_AGA
13 Trp_CCA
11 Leu_CAA
11 Val_TAC
10 Arg_TCG
10 Ile_TAT
10 Thr_TGT
9 Pro_TGG
8 Ala_CGC
8 Arg_ACG
7 Arg_TCT
7 Ser_TGA
5 Asp_ATC
5 Leu_TAG
5 Pro_CGG
5 Ser_CGA
5 Sup_TTA
5 Thr_CGT
4 Arg_CCG
4 Ser_ACT
2 Ala_GGC
2 Asn_ATT
2 Phe_AAA
1 Gly_ACC
1 Ile_GAT
1 SeC(e)_TCA
1 Sup_CTA
1 Tyr_ATA
1 Val_GAC
8. Comparison with results from the genomic tRNA database.
| tRNA | Tang et al., 2009 | Genomic tRNA database |
| Glu_TTC | 199 | 450 |
| Lys_TTT | 64 | 81 |
| Glu_CTC | 54 | 171 |
| Asp_GTC | 50 | 48 |
| Lys_CTT | 41 | 40 |
| Ala_AGC | 37 | 30 |
| Asn_GTT | 33 | 40 |
| Gly_TCC | 31 | 398 |
| Met_CAT | 31 | 33 |
| Cys_GCA | 29 | 353 |
| Phe_GAA | 29 | 28 |
| Val_CAC | 28 | 45 |
| Gly_GCC | 26 | 62 |
| Val_AAC | 26 | 23 |
| Gln_CTG | 23 | 42 |
| Tyr_GTA | 23 | 38 |
| Gln_TTG | 20 | 18 |
| Ile_AAT | 19 | 17 |
| SeC_TCA | 19 | 25 |
| Thr_AGT | 19 | 14 |
| Gly_CCC | 17 | 1315 |
| Ser_GCT | 17 | 18 |
| Ala_TGC | 16 | 34 |
| His_GTG | 16 | 22 |
| Leu_TAA | 16 | 9 |
| Leu_CAG | 14 | 9 |
| Pro_AGG | 14 | 12 |
| Arg_CCT | 13 | 111 |
| Leu_AAG | 13 | 11 |
| Ser_AGA | 13 | 14 |
| Trp_CCA | 13 | 182 |
| Leu_CAA | 11 | 15 |
| Val_TAC | 11 | 22 |
| Arg_TCG | 10 | 12 |
| Ile_TAT | 10 | 8 |
| Thr_TGT | 10 | 10 |
| Pro_TGG | 9 | 9 |
| Ala_CGC | 8 | 17 |
| Arg_ACG | 8 | 14 |
| Arg_TCT | 7 | 44 |
| Ser_TGA | 7 | 6 |
| Asp_ATC | 5 | 6 |
| Leu_TAG | 5 | 5 |
| Pro_CGG | 5 | 5 |
| Ser_CGA | 5 | 7 |
| Sup_TTA | 5 | 4 |
| Thr_CGT | 5 | 7 |
| Arg_CCG | 4 | 14 |
| Ser_ACT | 4 | 4 |
| Ala_GGC | 2 | 2 |
| Asn_ATT | 2 | 0 |
| Phe_AAA | 2 | 7 |
| Gly_ACC | 1 | 23 |
| Ile_GAT | 1 | 0 |
| SeC(e)_TCA | 1 | 25 |
| Sup_CTA | 1 | 12 |
| Tyr_ATA | 1 | 10 |
| Val_GAC | 1 | 0 |
Spearman’s rho of 0.748.
9. Lastly to find evolutionary conserved tRNAs, download the human, mouse, rat, fugu, chicken, horse, chimpanzee and dog genomes. Following the procedures listed above, run tRNAscan-SE on the respective genomes, produce tRNA fasta files for each respective genome using the Perl script in step 6. Remove intronic sequences detected by tRNAscan-SE using this script:
#!/usr/bin/perl
use strict;
use warnings;
my $usage = "Usage: $0 <tRNAscanOutput> <input.fa> <output.fa>\n";
my $infile = shift or die $usage;
my $fasta = shift or die $usage;
my $outfile = shift or die $usage;
my $acc = '';
my %seq = ();
open (IN,'<',$fasta) or die "Can't open $fasta: $!\n";
while (<IN>){
chomp;
#>tRNA_9_chr1_164648_164720_+_Pseudo_GCA_25.07
if (/^>(.*)$/){
$acc = $1;
}
elsif (/[AGTCagtc]+/){
$seq{$acc} .= $_;
}
else {
die "Can't parse line $.: $_\n";
}
}
close(IN);
my $counter = '0';
my $intron = '0';
my $no_intron = '0';
open (OUT, '>', $outfile) or die "Can't open $outfile for writing: $!\n";
open (IN, '<', $infile) or die "Can't open $infile\n";
while (<IN>){
chomp;
next if (/^Sequence/);
next if (/^Name/);
next if (/^--/);
next if (/^$/);
next if (/Pseudo/);
++$counter;
#Sequence tRNA Bounds tRNA Anti Intron Bounds Cove
#Name tRNA # Begin End Type Codon Begin End Score
#-------- ------ ---- ------ ---- ----- ----- ---- ------
#chr1 1 32999 33071 Pseudo GCA 0 0 20.17
my ($acc, $trnaNo, $trnaStart, $trnaEnd, $trnaType, $antiCodon, $intronStart, $intronEnd, $coveScore) = split(/\s+/);
if ($acc =~ /chrUn/){
$acc =~ s/chr//;
}
#tRNAscan likes to add a space to the coordinates randomly
$trnaStart =~ s/\s+//g;
$trnaEnd =~ s/\s+//g;
my $start = $trnaStart;
my $end = $trnaEnd;
my $strand = '+';
if ($trnaStart > $trnaEnd){
$strand = '-';
($trnaStart, $trnaEnd) = ($trnaEnd, $trnaStart);
}
if ($intronStart > $intronEnd){
($intronStart,$intronEnd) = ($intronEnd,$intronStart);
}
#my $id = "tRNA_${trnaNo}_${acc}_${trnaStart}_${trnaEnd}_${strand}_${trnaType}_${antiCodon}_${coveScore}";
my $id = "${acc}_${trnaStart}_${trnaEnd}_${strand}_${trnaType}_${antiCodon}_${coveScore}";
#Check if tRNAs are annotated correctly
unless (exists $seq{$id}){
die "tRNA $id does not exist in fasta\n";
}
my $sequence = $seq{$id};
my $length = length($sequence);
if ($intronStart != 0 && $intronEnd != 0){
++$intron;
#Offset starts at 0 for substring
#Coordinate for end of first part
my $firstPartCo = $intronStart - $trnaStart;
#Coordinate for start of second part
my $secondPartCo = $intronEnd - $trnaStart;
#Coordinate for end of second part
my $thirdPartCo = $trnaEnd - $intronEnd;
#IF intron is on negative strand need to start counting from the other end
if ($strand eq '-'){
$sequence = reverse($sequence);
#Offset starts at 0 for substring
#Coordinate for end of first part
my $firstPartCo = $intronStart - $trnaStart;
#Coordinate for start of second part
my $secondPartCo = $intronEnd - $trnaStart;
#Coordinate for end of second part
my $thirdPartCo = $trnaEnd - $intronEnd;
my $first = substr($sequence,0,$firstPartCo);
$first = reverse($first);
#0 based so add 1 to move 1 base to the right
my $second = substr($sequence,$secondPartCo+1,$thirdPartCo);
$second = reverse($second);
my $tempSeq = $second . $first;
$length = length ($tempSeq);
print OUT ">$id\n";
print OUT "$second$first\n";
}
else {
my $first = substr($sequence,0,$firstPartCo);
#0 based so add 1 to move 1 base to the right
my $second = substr($sequence,$secondPartCo+1,$thirdPartCo);
my $tempSeq = $first . $second;
$length = length ($tempSeq);
print OUT ">$id\n";
print OUT "$first$second\n";
}
}
else {
++$no_intron;
print OUT ">$id\n";
print OUT "$sequence\n";
}
}
close (IN);
close (OUT);
warn "Processed $counter tRNAs and there were $intron tRNAs with intronic sequence and $no_intron without\n";
exit(0);
Using each respective fasta file (where there are no intronic sequences), make BLAST databases and finally BLAST the initial cow tRNA fasta set to each respective tRNA database from the different genomes. We will use a sequence identity of >= 95% as a threshold for orthology.
/home/tan118/work/tRNAscan/trnascan_se_to_fasta.pl genome/human/hg19_trnascan_se.output genome/human/hg19.fa hg19_trnascan_se.fa /home/tan118/work/tRNAscan/trnascan_se_to_fasta.pl genome/mouse/mm9_trnascan_se.output genome/mouse/mm9.fa mm9_trnascan_se.fa /home/tan118/work/tRNAscan/trnascan_se_to_fasta.pl genome/rat/rn4_trnascan_se.output genome/rat/rn4.fa rn4_trnascan_se.fa /home/tan118/work/tRNAscan/trnascan_se_to_fasta.pl genome/fugu/fr2_trnascan_se.output genome/fugu/fr2.fa fr2_trnascan_se.fa /home/tan118/work/tRNAscan/trnascan_se_to_fasta.pl genome/horse/equ_cab_2_trnascan_se.output genome/horse/equ_cab_2.fa equ_cab_2_trnascan_se.fa /home/tan118/work/tRNAscan/trnascan_se_to_fasta.pl genome/dog/canfam2_trnascan_se.output genome/dog/canfam2.fa canfam2_trnascan_se.fa /home/tan118/work/tRNAscan/trnascan_se_to_fasta.pl genome/chicken/galgal3_trnascan_se.output genome/chicken/gal_gal_3.fa galgal3_trnascan_se.fa /home/tan118/work/tRNAscan/trnascan_se_to_fasta.pl genome/chimpanzee/pan_tro_3_trnascan_se.output genome/chimpanzee/panTro3.fa pan_tro_3_trnascan_se.fa makeblastdb -in hg19_trnascan_se_no_intron.fa -dbtype nucl -title human_trna -out human_trna makeblastdb -in mm9_trnascan_se_no_intron.fa -dbtype nucl -title mouse_trna -out mouse_trna makeblastdb -in rn4_trnascan_se_no_intron.fa -dbtype nucl -title rat_trna -out rat_trna makeblastdb -in fr2_trnascan_se_no_intron.fa -dbtype nucl -title fugu_trna -out fugu_trna makeblastdb -in equ_cab_2_trnascan_se_no_intron.fa -dbtype nucl -title horse_trna -out horse_trna makeblastdb -in canfam2_trnascan_se_no_intron.fa -dbtype nucl -title dog_trna -out dog_trna makeblastdb -in galgal3_trnascan_se_no_intron.fa -dbtype nucl -title chicken_trna -out chicken_trna makeblastdb -in pan_tro_3_trnascan_se_no_intron.fa -dbtype nucl -title chimpanzee_trna -out chimpanzee_trna blastn -db chicken_trna -outfmt 7 -evalue 1 -query bos_tau_5_trnascan_se_no_intron.fa -out bos_tau_5_trna_to_chicken_trna_no_intron.blastout blastn -db mouse_trna -outfmt 7 -evalue 1 -query bos_tau_5_trnascan_se_no_intron.fa -out bos_tau_5_trna_to_mouse_trna_no_intron.blastout blastn -db horse_trna -outfmt 7 -evalue 1 -query bos_tau_5_trnascan_se_no_intron.fa -out bos_tau_5_trna_to_horse_trna_no_intron.blastout blastn -db dog_trna -outfmt 7 -evalue 1 -query bos_tau_5_trnascan_se_no_intron.fa -out bos_tau_5_trna_to_dog_trna_no_intron.blastout blastn -db rat_trna -outfmt 7 -evalue 1 -query bos_tau_5_trnascan_se_no_intron.fa -out bos_tau_5_trna_to_rat_trna_no_intron.blastout blastn -db human_trna -outfmt 7 -evalue 1 -query bos_tau_5_trnascan_se_no_intron.fa -out bos_tau_5_trna_to_human_trna_no_intron.blastout blastn -db fugu_trna -outfmt 7 -evalue 1 -query bos_tau_5_trnascan_se_no_intron.fa -out bos_tau_5_trna_to_fugu_trna_no_intron.blastout blastn -db chimpanzee_trna -outfmt 7 -evalue 1 -query bos_tau_5_trnascan_se_no_intron.fa -out bos_tau_5_trna_to_chimpanzee_trna_no_intron.blastout
To find conserved tRNAs and the corresponding phylogenetic profile, use this script:
#!/usr/bin/perl
use strict;
use warnings;
my $usage = "Usage: $0 <threshold i.e. 1.00 -> 0.50>\n";
my $trna_fasta = 'bos_tau_5_trnascan_se_no_intron.fa';
my $threshold = shift or die $usage;
my @files = qw/
bos_tau_5_trna_to_chicken_trna_no_intron.blastout
bos_tau_5_trna_to_dog_trna_no_intron.blastout
bos_tau_5_trna_to_fugu_trna_no_intron.blastout
bos_tau_5_trna_to_horse_trna_no_intron.blastout
bos_tau_5_trna_to_human_trna_no_intron.blastout
bos_tau_5_trna_to_mouse_trna_no_intron.blastout
bos_tau_5_trna_to_rat_trna_no_intron.blastout
bos_tau_5_trna_to_chimpanzee_trna_no_intron.blastout
/;
my %phyloProfile = ();
my %tRNAlist = ();
open (IN, '<',"$trna_fasta") or die "Can't open $trna_fasta: $!\n";
while (<IN>){
chomp;
if (/^>(.*)/){
my $trna = $1;
my $seq = <IN>;
chomp($seq);
my $length = length($seq);
$tRNAlist{$trna} = $length;
} else {
die "Could not parse line $.: $_\n";
}
}
close (IN);
foreach my $currentFile (@files){
open (IN, '<', $currentFile) or die "Can't open $currentFile: $!\n";
while (<IN>){
chomp;
next if /^#/;
my ($query, $subject, $id, $aliLen, $mismatch, $gapOpen, $qStart, $qEnd, $sStart, $sEnd, $eValue, $bitScore) = split (/\t/);
if (exists $tRNAlist{$query}){
my $percentMatch = $id * $aliLen / '100';
if ($percentMatch =~ /(\d+)\.(\d+)/){
my $first = $1;
my $second = $2;
if ($second =~ /^9/){
$percentMatch = $first + 1;
}
elsif ($second =~ /^0/){
$percentMatch = $first;
}
}
if (($percentMatch / $tRNAlist{$query}) >= $threshold){
$phyloProfile{$query}{$currentFile} = '1';
}
}
else {
die "Query $query is missing\n";
}
}
close (IN);
}
foreach my $trna (keys %tRNAlist){
print "$trna\t";
my $line = '';
foreach my $files (@files){
if (exists $phyloProfile{$trna}{$files}){
$line .= '1';
}
else {
$line .= '0';
}
}
my $letter = substr($line,0,1);
substr($line,0,1,"1$letter");
print "$line\n";
}
The output from the above script should look something like this:
tRNA_2015_chrX_56490796_56490867_+_Undet_???_33.67 100000000
tRNA_3466_chr23_30023502_30023573_-_Glu_CTC_76.61 111111111
tRNA_1640_chr24_40048717_40048789_+_Gly_GCC_22.38 100000000
To identify the list of perfectly conserved tRNA genes, the above script with a threshold of 1 (i.e. 100%).
phylogenetic_profile_thresholding.pl 1 | grep 111111111 > perfect.list
Then use this script to extract the sequences:
#!/usr/bin/perl
use strict;
use warnings;
#>BTA1.trna335-ValAAC (5010368-5010440) Val (AAC) 73 bp Sc: 47.34
my $current = '';
my %fasta = ();
my %annotation =();
open(IN,'<','bos_tau_5_trnascan_se_no_intron.fa') || die "Could not open bos_tau_5_trnascan_se_no_intron.fa: $!\n";
while(<IN>){
chomp;
next if (/^$/);
if (/^>(.*)$/){
$current = $1;
next;
} else {
$fasta{$current} .= $_;
}
}
close(IN);
open(IN,'<','perfect.list') || die "Could not open perfect.list: $!\n";
while(<IN>){
chomp;
next if (/^$/);
#BTA23.trna1281-IleAAT 111111111
my ($id,$junk) = split(/\t/);
if (exists $fasta{$id}){
print ">$id\n$fasta{$id}\n";
} else {
die "Missing id $id\n";
}
}
close(IN);
I came up with 171 tRNA genes (compared with 150 in the paper) using 100% conservation and the unique tRNA gene count was 26 (compared with 22 in the paper). Reasons for the increase are most likely due to improved genome assemblies.
And at the 0.95 threshold, there were 439 cow/bovine tRNA genes.