Step by step analysis of identifying the bovine tRNA repertoire (using the most recent version of the bovine genome) based on the procedure published in this article. This was my first first-authored paper (published as part of the bovine genome companion papers), and in retrospect there could have been many improvements, however the paper was done on what I knew back then.
1. Download bosTau5.fa.gz from http://hgdownload-test.cse.ucsc.edu/goldenPath/bosTau5/bigZips/ and gunzip
2. Download and install tRNAscan-SE from http://lowelab.ucsc.edu/software/tRNAscan-SE.tar.gz
I had problems installing tRNAscan-SE on Ubuntu 11.04 (64 bit) but found the solution here: http://happykhan.com/?p=545
3. Setup RepeatMasker (instructions can be followed here). I downloaded repeatmaskerlibraries-20110419.tar.gz and RepeatMasker-open-3-3-0.tar.gz.
4. ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/ncbi-blast-2.2.25+-x64-linux.tar.gz
5. Once tRNAscan-SE is up, perform the search on bosTau5.fa
./tRNAscan-SE -o bos_tau_5_trnascan_se.output -f bos_tau_5_trnascan_se_trna.output -m bos_tau_5_trnascan_se.log ~/genome/bosTau5.fa
tRNAscan-SE v.1.3 (March 2011) scan results (on host tan118-VirtualBox)
Started: Sat Sep 10 14:33:30 JST 2011
------------------------------------------------------------
Sequence file(s) to search: /home/tan118/genome/bosTau5.fa
Search Mode: Eukaryotic
Results written to: bos_tau_5_trnascan_se.output
Output format: Tabular
Searching with: tRNAscan + EufindtRNA -> Cove
Covariance model: TRNA2-euk.cm
tRNAscan parameters: Strict
EufindtRNA parameters: Relaxed (Int Cutoff= -32.1)
tRNA secondary structure
predictions saved to: bos_tau_5_trnascan_se_trna.output
Search log saved in:
Search statistics saved in: bos_tau_5_trnascan_se.log
------------------------------------------------------------
First-pass (tRNAscan/EufindtRNA) Stats:
---------------
Sequences read: 11894
Seqs w/at least 1 hit: 6798
Bases read: 2918090798 (x2 for both strands)
Bases in tRNAs: 39937512
tRNAs predicted: 519405
Av. tRNA length: 76
Script CPU time: 6153.84 s
Scan CPU time: 1249.34 s
Scan speed: 4671.4 Kbp/sec
First pass search(es) ended: Sat Sep 10 16:41:51 JST 2011
Cove Stats:
-----------
Candidate tRNAs read: 519405
Cove-confirmed tRNAs: 225682
Bases scanned by Cove: 48247536
% seq scanned by Cove: 0.8 %
Script CPU time: 1553.15 s
Cove CPU time: 22435.24 s
Scan speed: 2150.5 bp/sec
Cove analysis of tRNAs ended: Sun Sep 11 08:42:34 JST 2011
Summary
--------
Overall scan speed: 185915.6 bp/sec
tRNAs decoding Standard 20 AA: 31889
Selenocysteine tRNAs (TCA): 3157
Possible suppressor tRNAs (CTA,TTA): 243
tRNAs with undetermined/unknown isotypes: 1143
Predicted pseudogenes: 189250
-------
Total tRNAs: 225682
tRNAs with introns: 4279
| Ala-AGC: 6 | Ala-GGC: 33 | Ala-CGC: 2 | Ala-TGC: 29 | Gly-ACC: 122 | Gly-GCC: 1826 | Gly-CCC: 74 | Gly-TCC: 103 | Pro-GGG: 4 | Pro-CGG: 1 | Pro-TGG: 1 | Thr-AGT: 6 | Thr-GGT: 8 | Thr-CGT: 2 | Thr-TGT: 5 | Val-AAC: 3 | Val-GAC: 48 | Val-CAC: 4 | Val-TAC: 8 | Ser-AGA: 8 | Ser-GGA: 13 | Ser-CGA: 2 | Ser-TGA: 3 | Ser-ACT: 22 | Ser-GCT: 148 | Arg-ACG: 8 | Arg-GCG: 34 | Arg-CCG: 1 | Arg-TCG: 7 | Arg-CCT: 12 | Arg-TCT: 15 | Leu-AAG: 6 | Leu-GAG: 5 | Leu-CAG: 8 | Leu-TAG: 2 | Leu-CAA: 9 | Leu-TAA: 4 | Phe-AAA: 7 | Phe-GAA: 46 | Asn-ATT: 7 | Asn-GTT: 16 | Lys-CTT: 3 | Lys-TTT: 8 | Asp-ATC: 14 | Asp-GTC: 174 | Glu-CTC: 12 | Glu-TTC: 47 | His-ATG: 6 | His-GTG: 49 | Gln-CTG: 7 | Gln-TTG: 11 | Ile-AAT: 3 | Ile-GAT: 8 | Ile-TAT: 6 | Met-CAT: 2 | Tyr-ATA: 15 | Tyr-GTA: 94 | Supres-CTA: 6 | Supres-TTA: 9 | Cys-ACA: 136 | Cys-GCA: 918 | Trp-CCA: 55 | SelCys-TCA: 28 |
Isotype / Anticodon Counts:
Ala : 335 AGC: 61 GGC: 54 CGC: 41 TGC: 179
Gly : 8382 ACC: 209 GCC: 2631 CCC: 3431 TCC: 2111
Pro : 37 AGG: 14 GGG: 3 CGG: 7 TGG: 13
Thr : 55 AGT: 25 GGT: 3 CGT: 11 TGT: 16
Val : 258 AAC: 39 GAC: 55 CAC: 99 TAC: 65
Ser : 637 AGA: 41 GGA: 214 CGA: 22 TGA: 29 ACT: 39 GCT: 292
Arg : 932 ACG: 51 GCG: 234 CCG: 57 TCG: 26 CCT: 308 TCT: 256
Leu : 174 AAG: 17 GAG: 25 CAG: 19 TAG: 14 CAA: 63 TAA: 36
Phe : 414 AAA: 48 GAA: 366
Asn : 81 ATT: 13 GTT: 68
Lys : 358 CTT: 110 TTT: 248
Asp : 360 ATC: 37 GTC: 323
Glu : 2493 CTC: 618 TTC: 1875
His : 237 ATG: 36 GTG: 201
Gln : 173 CTG: 104 TTG: 69
Ile : 68 AAT: 28 GAT: 18 TAT: 22
Met : 46 CAT: 46
Tyr : 967 ATA: 179 GTA: 788
Supres: 243 CTA: 162 TTA: 81
Cys : 14133 ACA: 1930 GCA: 12203
Trp : 1749 CCA: 1749
SelCys: 3157 TCA: 3157
6. Make fasta file from tRNAscan-SE output using the script below.
#!/usr/bin/perl use strict; use warnings; my $usage = "Usage: $0 <trnascan.output> <genome.fa> <output.fa>\n"; my $infile = shift or die $usage; my $genome = shift or die $usage; my $outfile = shift or die $usage; my %genome = (); my $current_acc = ''; open(IN,'<',$genome) || die "Could not open $genome: $!\n"; TEST: while(<IN>){ chomp; #if you are working with bacterial genomes change the line below to #if (/^>([A-Za-z0-9\|\.\_]+)\s+/){ if (/^>(.*)$/){ $current_acc = $1; } else { $genome{$current_acc} .= $_; } #last TEST unless $current_acc eq 'chr1'; } close(IN); warn "Stored genome into memory\n"; my %trna = (); open (IN, '<', $infile) or die "Could not open $infile: $!\n"; open (OUT,'>', $outfile) or die "Could not open $outfile for writing: $!\n"; while (<IN>){ chomp; #Sequence tRNA Bounds tRNA Anti Intron Bounds Cove #Name tRNA # Begin End Type Codon Begin End Score #-------- ------ ---- ------ ---- ----- ----- ---- ------ #chr1 1 32999 33071 Pseudo GCA 0 0 20.17 #chr1 2 34208 34280 Pseudo GCA 0 0 31.02 #chr1 3 34729 34800 Pseudo GAA 0 0 24.24 #chr1 3924 115232036 115232108 Lys TTT 0 0 79.55 next if (/^Sequence/); next if (/^Name/); next if (/^--/); next if (/^$/); my $strand = '+'; my ($acc, $trna_number, $trna_start, $trna_end, $trna_type, $anticodon, $intron_start, $intron_end, $cove_score) = split(/\s+/); #tRNAscan likes to add a space to the coordinates randomly $trna_start =~ s/\s+//g; $trna_end =~ s/\s+//g; if ($trna_start > $trna_end){ $strand = '-'; ($trna_start, $trna_end) = ($trna_end, $trna_start); } my $length = $trna_end - $trna_start; #substr is 0 based and coordinates in tRNAscan-SE are 1 based my $seq = substr($genome{$acc},$trna_start-1,$length+1); $seq = uc($seq); if ($strand eq '-'){ $seq =~ tr/ACGT/TGCA/; $seq = reverse($seq); } #my $id = "tRNA_${trna_number}_${acc}_${trna_start}_${trna_end}_${strand}_${trna_type}_${anticodon}_${cove_score}"; if ($acc =~ /chrUn/){ $acc =~ s/^chr//; } my $id = "${acc}_${trna_start}_${trna_end}_${strand}_${trna_type}_${anticodon}_${cove_score}"; print OUT ">$id\n$seq\n"; } close(IN); close(OUT); exit(0);
7. Run RepeatMasker on the fasta file using these parameters:
./RepeatMasker -s -species cow bos_tau_5_trnascan_se_short_id.fa
8. Some stats from RepeatMasker
cat bos_tau_5_trnascan_se_short_id.fa.out | grep -P "\s+\d+" | awk '{print $5}' | sort -u | wc 225667 225667 9513704 cat bos_tau_5_trnascan_se_short_id.fa | grep ">" | wc 225682 225682 9740003 cat bos_tau_5_trnascan_se_short_id.fa.out | grep -P "\s+\d+" | awk '{print $11}' | sort | uniq -c | sort -k1rn | less 149028 SINE/BovA 74586 SINE/tRNA-Glu 2264 tRNA 126 LTR/ERV1 19 LINE/L1 13 Simple_repeat 12 Low_complexity 10 SINE/MIR 2 SINE/tRNA 2 Unknown 1 SINE/RTE-BovB 1 snRNA
Perl script called remove_repeat.pl used to remove putative tRNA entries tagged as something other than tRNA by RepeatMasker
#!/usr/bin/perl use strict; use warnings; my $fa = 'bos_tau_5_trnascan_se_short_id.fa'; my $rm = 'bos_tau_5_trnascan_se_short_id.fa.out'; my %repeat = (); open(IN,'<',$rm) || die "Could not open $rm: $!\n"; while(<IN>){ chomp; next unless /^\s+\d+/; my @line_split = split(/\s+/); next if $line_split[11] eq 'tRNA'; $repeat{$line_split[5]} = '1'; } close(IN); open(IN,'<',$fa) || die "Could not open $fa: $!\n"; while(<IN>){ chomp; if (/^>(.*)/){ my $defline = $1; my $seq = <IN>; chomp($seq); next if (exists $repeat{$defline}); print ">$defline\n$seq\n"; } else { die "Error in line $.: $_\n"; } } close(IN);
remove_repeat.pl | grep ">" | wc 2239 2239 93827 remove_repeat.pl | grep ">" | grep -v "Pseudo" | wc 1257 1257 50982 remove_repeat.pl | grep ">" | grep -v "Pseudo" | grep -v ">Un" | wc 1121 1121 45380 remove_repeat.pl | grep ">" | grep -v "Pseudo" | grep -v ">Un" | cut -f5 -d'_' | sort | uniq -c | sort -k1rn | less 253 Glu 105 Lys 75 Gly 66 Val 63 Ala 59 Leu 55 Asp 46 Ser 43 Gln 42 Arg 35 Asn 34 Thr 31 Met 31 Phe 30 Ile 29 Cys 28 Pro 24 Tyr 19 SeC 17 Undet 16 His 13 Trp 6 Sup 1 SeC(e) remove_repeat.pl | grep ">" | grep -v "Pseudo" | grep -v ">Un" | cut -f5,6 -d'_' | sort | uniq -c | sort -k1rn | less 199 Glu_TTC 64 Lys_TTT 54 Glu_CTC 50 Asp_GTC 41 Lys_CTT 37 Ala_AGC 33 Asn_GTT 31 Gly_TCC 31 Met_CAT 29 Cys_GCA 29 Phe_GAA 28 Val_CAC 26 Gly_GCC 26 Val_AAC 23 Gln_CTG 23 Tyr_GTA 20 Gln_TTG 19 Ile_AAT 19 SeC_TCA 19 Thr_AGT 17 Gly_CCC 17 Ser_GCT 17 Undet_??? 16 Ala_TGC 16 His_GTG 16 Leu_TAA 14 Leu_CAG 14 Pro_AGG 13 Arg_CCT 13 Leu_AAG 13 Ser_AGA 13 Trp_CCA 11 Leu_CAA 11 Val_TAC 10 Arg_TCG 10 Ile_TAT 10 Thr_TGT 9 Pro_TGG 8 Ala_CGC 8 Arg_ACG 7 Arg_TCT 7 Ser_TGA 5 Asp_ATC 5 Leu_TAG 5 Pro_CGG 5 Ser_CGA 5 Sup_TTA 5 Thr_CGT 4 Arg_CCG 4 Ser_ACT 2 Ala_GGC 2 Asn_ATT 2 Phe_AAA 1 Gly_ACC 1 Ile_GAT 1 SeC(e)_TCA 1 Sup_CTA 1 Tyr_ATA 1 Val_GAC
8. Comparison with results from the genomic tRNA database.
tRNA | Tang et al., 2009 | Genomic tRNA database |
Glu_TTC | 199 | 450 |
Lys_TTT | 64 | 81 |
Glu_CTC | 54 | 171 |
Asp_GTC | 50 | 48 |
Lys_CTT | 41 | 40 |
Ala_AGC | 37 | 30 |
Asn_GTT | 33 | 40 |
Gly_TCC | 31 | 398 |
Met_CAT | 31 | 33 |
Cys_GCA | 29 | 353 |
Phe_GAA | 29 | 28 |
Val_CAC | 28 | 45 |
Gly_GCC | 26 | 62 |
Val_AAC | 26 | 23 |
Gln_CTG | 23 | 42 |
Tyr_GTA | 23 | 38 |
Gln_TTG | 20 | 18 |
Ile_AAT | 19 | 17 |
SeC_TCA | 19 | 25 |
Thr_AGT | 19 | 14 |
Gly_CCC | 17 | 1315 |
Ser_GCT | 17 | 18 |
Ala_TGC | 16 | 34 |
His_GTG | 16 | 22 |
Leu_TAA | 16 | 9 |
Leu_CAG | 14 | 9 |
Pro_AGG | 14 | 12 |
Arg_CCT | 13 | 111 |
Leu_AAG | 13 | 11 |
Ser_AGA | 13 | 14 |
Trp_CCA | 13 | 182 |
Leu_CAA | 11 | 15 |
Val_TAC | 11 | 22 |
Arg_TCG | 10 | 12 |
Ile_TAT | 10 | 8 |
Thr_TGT | 10 | 10 |
Pro_TGG | 9 | 9 |
Ala_CGC | 8 | 17 |
Arg_ACG | 8 | 14 |
Arg_TCT | 7 | 44 |
Ser_TGA | 7 | 6 |
Asp_ATC | 5 | 6 |
Leu_TAG | 5 | 5 |
Pro_CGG | 5 | 5 |
Ser_CGA | 5 | 7 |
Sup_TTA | 5 | 4 |
Thr_CGT | 5 | 7 |
Arg_CCG | 4 | 14 |
Ser_ACT | 4 | 4 |
Ala_GGC | 2 | 2 |
Asn_ATT | 2 | 0 |
Phe_AAA | 2 | 7 |
Gly_ACC | 1 | 23 |
Ile_GAT | 1 | 0 |
SeC(e)_TCA | 1 | 25 |
Sup_CTA | 1 | 12 |
Tyr_ATA | 1 | 10 |
Val_GAC | 1 | 0 |
Spearman's rho of 0.748.
9. Lastly to find evolutionary conserved tRNAs, download the human, mouse, rat, fugu, chicken, horse, chimpanzee and dog genomes. Following the procedures listed above, run tRNAscan-SE on the respective genomes, produce tRNA fasta files for each respective genome using the Perl script in step 6. Remove intronic sequences detected by tRNAscan-SE using this script:
#!/usr/bin/perl use strict; use warnings; my $usage = "Usage: $0 <tRNAscanOutput> <input.fa> <output.fa>\n"; my $infile = shift or die $usage; my $fasta = shift or die $usage; my $outfile = shift or die $usage; my $acc = ''; my %seq = (); open (IN,'<',$fasta) or die "Can't open $fasta: $!\n"; while (<IN>){ chomp; #>tRNA_9_chr1_164648_164720_+_Pseudo_GCA_25.07 if (/^>(.*)$/){ $acc = $1; } elsif (/[AGTCagtc]+/){ $seq{$acc} .= $_; } else { die "Can't parse line $.: $_\n"; } } close(IN); my $counter = '0'; my $intron = '0'; my $no_intron = '0'; open (OUT, '>', $outfile) or die "Can't open $outfile for writing: $!\n"; open (IN, '<', $infile) or die "Can't open $infile\n"; while (<IN>){ chomp; next if (/^Sequence/); next if (/^Name/); next if (/^--/); next if (/^$/); next if (/Pseudo/); ++$counter; #Sequence tRNA Bounds tRNA Anti Intron Bounds Cove #Name tRNA # Begin End Type Codon Begin End Score #-------- ------ ---- ------ ---- ----- ----- ---- ------ #chr1 1 32999 33071 Pseudo GCA 0 0 20.17 my ($acc, $trnaNo, $trnaStart, $trnaEnd, $trnaType, $antiCodon, $intronStart, $intronEnd, $coveScore) = split(/\s+/); if ($acc =~ /chrUn/){ $acc =~ s/chr//; } #tRNAscan likes to add a space to the coordinates randomly $trnaStart =~ s/\s+//g; $trnaEnd =~ s/\s+//g; my $start = $trnaStart; my $end = $trnaEnd; my $strand = '+'; if ($trnaStart > $trnaEnd){ $strand = '-'; ($trnaStart, $trnaEnd) = ($trnaEnd, $trnaStart); } if ($intronStart > $intronEnd){ ($intronStart,$intronEnd) = ($intronEnd,$intronStart); } #my $id = "tRNA_${trnaNo}_${acc}_${trnaStart}_${trnaEnd}_${strand}_${trnaType}_${antiCodon}_${coveScore}"; my $id = "${acc}_${trnaStart}_${trnaEnd}_${strand}_${trnaType}_${antiCodon}_${coveScore}"; #Check if tRNAs are annotated correctly unless (exists $seq{$id}){ die "tRNA $id does not exist in fasta\n"; } my $sequence = $seq{$id}; my $length = length($sequence); if ($intronStart != 0 && $intronEnd != 0){ ++$intron; #Offset starts at 0 for substring #Coordinate for end of first part my $firstPartCo = $intronStart - $trnaStart; #Coordinate for start of second part my $secondPartCo = $intronEnd - $trnaStart; #Coordinate for end of second part my $thirdPartCo = $trnaEnd - $intronEnd; #IF intron is on negative strand need to start counting from the other end if ($strand eq '-'){ $sequence = reverse($sequence); #Offset starts at 0 for substring #Coordinate for end of first part my $firstPartCo = $intronStart - $trnaStart; #Coordinate for start of second part my $secondPartCo = $intronEnd - $trnaStart; #Coordinate for end of second part my $thirdPartCo = $trnaEnd - $intronEnd; my $first = substr($sequence,0,$firstPartCo); $first = reverse($first); #0 based so add 1 to move 1 base to the right my $second = substr($sequence,$secondPartCo+1,$thirdPartCo); $second = reverse($second); my $tempSeq = $second . $first; $length = length ($tempSeq); print OUT ">$id\n"; print OUT "$second$first\n"; } else { my $first = substr($sequence,0,$firstPartCo); #0 based so add 1 to move 1 base to the right my $second = substr($sequence,$secondPartCo+1,$thirdPartCo); my $tempSeq = $first . $second; $length = length ($tempSeq); print OUT ">$id\n"; print OUT "$first$second\n"; } } else { ++$no_intron; print OUT ">$id\n"; print OUT "$sequence\n"; } } close (IN); close (OUT); warn "Processed $counter tRNAs and there were $intron tRNAs with intronic sequence and $no_intron without\n"; exit(0);
Using each respective fasta file (where there are no intronic sequences), make BLAST databases and finally BLAST the initial cow tRNA fasta set to each respective tRNA database from the different genomes. We will use a sequence identity of >= 95% as a threshold for orthology.
/home/tan118/work/tRNAscan/trnascan_se_to_fasta.pl genome/human/hg19_trnascan_se.output genome/human/hg19.fa hg19_trnascan_se.fa /home/tan118/work/tRNAscan/trnascan_se_to_fasta.pl genome/mouse/mm9_trnascan_se.output genome/mouse/mm9.fa mm9_trnascan_se.fa /home/tan118/work/tRNAscan/trnascan_se_to_fasta.pl genome/rat/rn4_trnascan_se.output genome/rat/rn4.fa rn4_trnascan_se.fa /home/tan118/work/tRNAscan/trnascan_se_to_fasta.pl genome/fugu/fr2_trnascan_se.output genome/fugu/fr2.fa fr2_trnascan_se.fa /home/tan118/work/tRNAscan/trnascan_se_to_fasta.pl genome/horse/equ_cab_2_trnascan_se.output genome/horse/equ_cab_2.fa equ_cab_2_trnascan_se.fa /home/tan118/work/tRNAscan/trnascan_se_to_fasta.pl genome/dog/canfam2_trnascan_se.output genome/dog/canfam2.fa canfam2_trnascan_se.fa /home/tan118/work/tRNAscan/trnascan_se_to_fasta.pl genome/chicken/galgal3_trnascan_se.output genome/chicken/gal_gal_3.fa galgal3_trnascan_se.fa /home/tan118/work/tRNAscan/trnascan_se_to_fasta.pl genome/chimpanzee/pan_tro_3_trnascan_se.output genome/chimpanzee/panTro3.fa pan_tro_3_trnascan_se.fa makeblastdb -in hg19_trnascan_se_no_intron.fa -dbtype nucl -title human_trna -out human_trna makeblastdb -in mm9_trnascan_se_no_intron.fa -dbtype nucl -title mouse_trna -out mouse_trna makeblastdb -in rn4_trnascan_se_no_intron.fa -dbtype nucl -title rat_trna -out rat_trna makeblastdb -in fr2_trnascan_se_no_intron.fa -dbtype nucl -title fugu_trna -out fugu_trna makeblastdb -in equ_cab_2_trnascan_se_no_intron.fa -dbtype nucl -title horse_trna -out horse_trna makeblastdb -in canfam2_trnascan_se_no_intron.fa -dbtype nucl -title dog_trna -out dog_trna makeblastdb -in galgal3_trnascan_se_no_intron.fa -dbtype nucl -title chicken_trna -out chicken_trna makeblastdb -in pan_tro_3_trnascan_se_no_intron.fa -dbtype nucl -title chimpanzee_trna -out chimpanzee_trna blastn -db chicken_trna -outfmt 7 -evalue 1 -query bos_tau_5_trnascan_se_no_intron.fa -out bos_tau_5_trna_to_chicken_trna_no_intron.blastout blastn -db mouse_trna -outfmt 7 -evalue 1 -query bos_tau_5_trnascan_se_no_intron.fa -out bos_tau_5_trna_to_mouse_trna_no_intron.blastout blastn -db horse_trna -outfmt 7 -evalue 1 -query bos_tau_5_trnascan_se_no_intron.fa -out bos_tau_5_trna_to_horse_trna_no_intron.blastout blastn -db dog_trna -outfmt 7 -evalue 1 -query bos_tau_5_trnascan_se_no_intron.fa -out bos_tau_5_trna_to_dog_trna_no_intron.blastout blastn -db rat_trna -outfmt 7 -evalue 1 -query bos_tau_5_trnascan_se_no_intron.fa -out bos_tau_5_trna_to_rat_trna_no_intron.blastout blastn -db human_trna -outfmt 7 -evalue 1 -query bos_tau_5_trnascan_se_no_intron.fa -out bos_tau_5_trna_to_human_trna_no_intron.blastout blastn -db fugu_trna -outfmt 7 -evalue 1 -query bos_tau_5_trnascan_se_no_intron.fa -out bos_tau_5_trna_to_fugu_trna_no_intron.blastout blastn -db chimpanzee_trna -outfmt 7 -evalue 1 -query bos_tau_5_trnascan_se_no_intron.fa -out bos_tau_5_trna_to_chimpanzee_trna_no_intron.blastout
To find conserved tRNAs and the corresponding phylogenetic profile, use this script:
#!/usr/bin/perl use strict; use warnings; my $usage = "Usage: $0 <threshold i.e. 1.00 -> 0.50>\n"; my $trna_fasta = 'bos_tau_5_trnascan_se_no_intron.fa'; my $threshold = shift or die $usage; my @files = qw/ bos_tau_5_trna_to_chicken_trna_no_intron.blastout bos_tau_5_trna_to_dog_trna_no_intron.blastout bos_tau_5_trna_to_fugu_trna_no_intron.blastout bos_tau_5_trna_to_horse_trna_no_intron.blastout bos_tau_5_trna_to_human_trna_no_intron.blastout bos_tau_5_trna_to_mouse_trna_no_intron.blastout bos_tau_5_trna_to_rat_trna_no_intron.blastout bos_tau_5_trna_to_chimpanzee_trna_no_intron.blastout /; my %phyloProfile = (); my %tRNAlist = (); open (IN, '<',"$trna_fasta") or die "Can't open $trna_fasta: $!\n"; while (<IN>){ chomp; if (/^>(.*)/){ my $trna = $1; my $seq = <IN>; chomp($seq); my $length = length($seq); $tRNAlist{$trna} = $length; } else { die "Could not parse line $.: $_\n"; } } close (IN); foreach my $currentFile (@files){ open (IN, '<', $currentFile) or die "Can't open $currentFile: $!\n"; while (<IN>){ chomp; next if /^#/; my ($query, $subject, $id, $aliLen, $mismatch, $gapOpen, $qStart, $qEnd, $sStart, $sEnd, $eValue, $bitScore) = split (/\t/); if (exists $tRNAlist{$query}){ my $percentMatch = $id * $aliLen / '100'; if ($percentMatch =~ /(\d+)\.(\d+)/){ my $first = $1; my $second = $2; if ($second =~ /^9/){ $percentMatch = $first + 1; } elsif ($second =~ /^0/){ $percentMatch = $first; } } if (($percentMatch / $tRNAlist{$query}) >= $threshold){ $phyloProfile{$query}{$currentFile} = '1'; } } else { die "Query $query is missing\n"; } } close (IN); } foreach my $trna (keys %tRNAlist){ print "$trna\t"; my $line = ''; foreach my $files (@files){ if (exists $phyloProfile{$trna}{$files}){ $line .= '1'; } else { $line .= '0'; } } my $letter = substr($line,0,1); substr($line,0,1,"1$letter"); print "$line\n"; }
The output from the above script should look something like this:
tRNA_2015_chrX_56490796_56490867_+_Undet_???_33.67 100000000
tRNA_3466_chr23_30023502_30023573_-_Glu_CTC_76.61 111111111
tRNA_1640_chr24_40048717_40048789_+_Gly_GCC_22.38 100000000
To identify the list of perfectly conserved tRNA genes, the above script with a threshold of 1 (i.e. 100%).
phylogenetic_profile_thresholding.pl 1 | grep 111111111 > perfect.list
Then use this script to extract the sequences:
#!/usr/bin/perl use strict; use warnings; #>BTA1.trna335-ValAAC (5010368-5010440) Val (AAC) 73 bp Sc: 47.34 my $current = ''; my %fasta = (); my %annotation =(); open(IN,'<','bos_tau_5_trnascan_se_no_intron.fa') || die "Could not open bos_tau_5_trnascan_se_no_intron.fa: $!\n"; while(<IN>){ chomp; next if (/^$/); if (/^>(.*)$/){ $current = $1; next; } else { $fasta{$current} .= $_; } } close(IN); open(IN,'<','perfect.list') || die "Could not open perfect.list: $!\n"; while(<IN>){ chomp; next if (/^$/); #BTA23.trna1281-IleAAT 111111111 my ($id,$junk) = split(/\t/); if (exists $fasta{$id}){ print ">$id\n$fasta{$id}\n"; } else { die "Missing id $id\n"; } } close(IN);
I came up with 171 tRNA genes (compared with 150 in the paper) using 100% conservation and the unique tRNA gene count was 26 (compared with 22 in the paper). Reasons for the increase are most likely due to improved genome assemblies.
And at the 0.95 threshold, there were 439 cow/bovine tRNA genes.

This work is licensed under a Creative Commons
Attribution 4.0 International License.