Analysis of the complement and molecular evolution of tRNA genes in cow

Step by step analysis of identifying the bovine tRNA repertoire (using the most recent version of the bovine genome) based on the procedure published in this article. This was my first first-authored paper (published as part of the bovine genome companion papers), and in retrospect there could have been many improvements, however the paper was done on what I knew back then.

1. Download bosTau5.fa.gz from http://hgdownload-test.cse.ucsc.edu/goldenPath/bosTau5/bigZips/ and gunzip

2. Download and install tRNAscan-SE from http://lowelab.ucsc.edu/software/tRNAscan-SE.tar.gz

I had problems installing tRNAscan-SE on Ubuntu 11.04 (64 bit) but found the solution here: http://happykhan.com/?p=545

3. Setup RepeatMasker (instructions can be followed here). I downloaded repeatmaskerlibraries-20110419.tar.gz and RepeatMasker-open-3-3-0.tar.gz.

4. ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/ncbi-blast-2.2.25+-x64-linux.tar.gz

5. Once tRNAscan-SE is up, perform the search on bosTau5.fa

./tRNAscan-SE -o bos_tau_5_trnascan_se.output -f bos_tau_5_trnascan_se_trna.output -m bos_tau_5_trnascan_se.log ~/genome/bosTau5.fa

tRNAscan-SE v.1.3 (March 2011) scan results (on host tan118-VirtualBox)
Started: Sat Sep 10 14:33:30 JST 2011

------------------------------------------------------------
Sequence file(s) to search: /home/tan118/genome/bosTau5.fa
Search Mode: Eukaryotic
Results written to: bos_tau_5_trnascan_se.output
Output format: Tabular
Searching with: tRNAscan + EufindtRNA -> Cove
Covariance model: TRNA2-euk.cm
tRNAscan parameters: Strict
EufindtRNA parameters: Relaxed (Int Cutoff= -32.1)
tRNA secondary structure
predictions saved to: bos_tau_5_trnascan_se_trna.output
Search log saved in:
Search statistics saved in: bos_tau_5_trnascan_se.log
------------------------------------------------------------

First-pass (tRNAscan/EufindtRNA) Stats:
---------------
Sequences read: 11894
Seqs w/at least 1 hit: 6798
Bases read: 2918090798 (x2 for both strands)
Bases in tRNAs: 39937512
tRNAs predicted: 519405
Av. tRNA length: 76
Script CPU time: 6153.84 s
Scan CPU time: 1249.34 s
Scan speed: 4671.4 Kbp/sec

First pass search(es) ended: Sat Sep 10 16:41:51 JST 2011

Cove Stats:
-----------
Candidate tRNAs read: 519405
Cove-confirmed tRNAs: 225682
Bases scanned by Cove: 48247536
% seq scanned by Cove: 0.8 %
Script CPU time: 1553.15 s
Cove CPU time: 22435.24 s
Scan speed: 2150.5 bp/sec

Cove analysis of tRNAs ended: Sun Sep 11 08:42:34 JST 2011

Summary
--------
Overall scan speed: 185915.6 bp/sec

tRNAs decoding Standard 20 AA: 31889
Selenocysteine tRNAs (TCA): 3157
Possible suppressor tRNAs (CTA,TTA): 243
tRNAs with undetermined/unknown isotypes: 1143
Predicted pseudogenes: 189250
-------
Total tRNAs: 225682

Isotype / Anticodon Counts:

Ala : 335 AGC: 61 GGC: 54 CGC: 41 TGC: 179
Gly : 8382 ACC: 209 GCC: 2631 CCC: 3431 TCC: 2111
Pro : 37 AGG: 14 GGG: 3 CGG: 7 TGG: 13
Thr : 55 AGT: 25 GGT: 3 CGT: 11 TGT: 16
Val : 258 AAC: 39 GAC: 55 CAC: 99 TAC: 65
Ser : 637 AGA: 41 GGA: 214 CGA: 22 TGA: 29 ACT: 39 GCT: 292
Arg : 932 ACG: 51 GCG: 234 CCG: 57 TCG: 26 CCT: 308 TCT: 256
Leu : 174 AAG: 17 GAG: 25 CAG: 19 TAG: 14 CAA: 63 TAA: 36
Phe : 414 AAA: 48 GAA: 366
Asn : 81 ATT: 13 GTT: 68
Lys : 358 CTT: 110 TTT: 248
Asp : 360 ATC: 37 GTC: 323
Glu : 2493 CTC: 618 TTC: 1875
His : 237 ATG: 36 GTG: 201
Gln : 173 CTG: 104 TTG: 69
Ile : 68 AAT: 28 GAT: 18 TAT: 22
Met : 46 CAT: 46
Tyr : 967 ATA: 179 GTA: 788
Supres: 243 CTA: 162 TTA: 81
Cys : 14133 ACA: 1930 GCA: 12203
Trp : 1749 CCA: 1749
SelCys: 3157 TCA: 3157

6. Make fasta file from tRNAscan-SE output using the script below.

#!/usr/bin/perl

use strict;
use warnings;

my $usage = "Usage: $0 <trnascan.output> <genome.fa> <output.fa>\n";
my $infile = shift or die $usage;
my $genome = shift or die $usage;
my $outfile = shift or die $usage;

my %genome = ();
my $current_acc = '';
open(IN,'<',$genome) || die "Could not open $genome: $!\n";
TEST: while(<IN>){
   chomp;
   #if you are working with bacterial genomes change the line below to
   #if (/^>([A-Za-z0-9\|\.\_]+)\s+/){
   if (/^>(.*)$/){
      $current_acc = $1;
   } else {
      $genome{$current_acc} .= $_;
   }
   #last TEST unless $current_acc eq 'chr1';
}
close(IN);

warn "Stored genome into memory\n";

my %trna = ();
open (IN, '<', $infile) or die "Could not open $infile: $!\n";
open (OUT,'>', $outfile) or die "Could not open $outfile for writing: $!\n";
while (<IN>){
   chomp;
#Sequence                tRNA            Bounds          tRNA    Anti    Intron Bounds   Cove
#Name            tRNA #  Begin           End             Type    Codon   Begin   End     Score
#--------        ------  ----            ------          ----    -----   -----   ----    ------
#chr1            1       32999           33071           Pseudo  GCA     0       0       20.17
#chr1            2       34208           34280           Pseudo  GCA     0       0       31.02
#chr1            3       34729           34800           Pseudo  GAA     0       0       24.24
#chr1            3924    115232036       115232108       Lys     TTT     0       0       79.55
   next if (/^Sequence/);
   next if (/^Name/);
   next if (/^--/);
   next if (/^$/);
   my $strand = '+';
   my ($acc, $trna_number, $trna_start, $trna_end, $trna_type, $anticodon, $intron_start, $intron_end, $cove_score) = split(/\s+/);

   #tRNAscan likes to add a space to the coordinates randomly
   $trna_start =~ s/\s+//g;
   $trna_end =~ s/\s+//g;
   if ($trna_start > $trna_end){
      $strand = '-';
      ($trna_start, $trna_end) = ($trna_end, $trna_start);
   }

   my $length = $trna_end - $trna_start;
   #substr is 0 based and coordinates in tRNAscan-SE are 1 based
   my $seq = substr($genome{$acc},$trna_start-1,$length+1);
   $seq = uc($seq);
   if ($strand eq '-'){
      $seq =~ tr/ACGT/TGCA/;
      $seq = reverse($seq);
   }
   #my $id = "tRNA_${trna_number}_${acc}_${trna_start}_${trna_end}_${strand}_${trna_type}_${anticodon}_${cove_score}";
   if ($acc =~ /chrUn/){
      $acc =~ s/^chr//;
   }
   my $id = "${acc}_${trna_start}_${trna_end}_${strand}_${trna_type}_${anticodon}_${cove_score}";
   print OUT ">$id\n$seq\n";

}
close(IN);
close(OUT);

exit(0);

7. Run RepeatMasker on the fasta file using these parameters:

./RepeatMasker -s -species cow bos_tau_5_trnascan_se_short_id.fa

8. Some stats from RepeatMasker

cat bos_tau_5_trnascan_se_short_id.fa.out | grep -P "\s+\d+" | awk '{print $5}' | sort -u | wc
 225667  225667 9513704
cat bos_tau_5_trnascan_se_short_id.fa | grep ">" | wc
 225682  225682 9740003
cat bos_tau_5_trnascan_se_short_id.fa.out | grep -P "\s+\d+" | awk '{print $11}' | sort | uniq -c | sort -k1rn | less
 149028 SINE/BovA
  74586 SINE/tRNA-Glu
   2264 tRNA
    126 LTR/ERV1
     19 LINE/L1
     13 Simple_repeat
     12 Low_complexity
     10 SINE/MIR
      2 SINE/tRNA
      2 Unknown
      1 SINE/RTE-BovB
      1 snRNA

Perl script called remove_repeat.pl used to remove putative tRNA entries tagged as something other than tRNA by RepeatMasker

#!/usr/bin/perl

use strict;
use warnings;

my $fa = 'bos_tau_5_trnascan_se_short_id.fa';
my $rm = 'bos_tau_5_trnascan_se_short_id.fa.out';

my %repeat = ();
open(IN,'<',$rm) || die "Could not open $rm: $!\n";
while(<IN>){
   chomp;
   next unless /^\s+\d+/;
   my @line_split = split(/\s+/);
   next if $line_split[11] eq 'tRNA';
   $repeat{$line_split[5]} = '1';
}
close(IN);

open(IN,'<',$fa) || die "Could not open $fa: $!\n";
while(<IN>){
   chomp;
   if (/^>(.*)/){
      my $defline = $1;
      my $seq = <IN>;
      chomp($seq);
      next if (exists $repeat{$defline});
      print ">$defline\n$seq\n";
   } else {
      die "Error in line $.: $_\n";
   }
}
close(IN);

remove_repeat.pl | grep ">" | wc
   2239    2239   93827
remove_repeat.pl | grep ">" | grep -v "Pseudo" | wc
   1257    1257   50982
remove_repeat.pl | grep ">" | grep -v "Pseudo" | grep -v ">Un" | wc
   1121    1121   45380
remove_repeat.pl | grep ">" | grep -v "Pseudo" | grep -v ">Un" | cut -f5 -d'_' | sort | uniq -c | sort -k1rn | less
    253 Glu
    105 Lys
     75 Gly
     66 Val
     63 Ala
     59 Leu
     55 Asp
     46 Ser
     43 Gln
     42 Arg
     35 Asn
     34 Thr
     31 Met
     31 Phe
     30 Ile
     29 Cys
     28 Pro
     24 Tyr
     19 SeC
     17 Undet
     16 His
     13 Trp
      6 Sup
      1 SeC(e)
remove_repeat.pl | grep ">" | grep -v "Pseudo" | grep -v ">Un" | cut -f5,6 -d'_' | sort | uniq -c | sort -k1rn | less
    199 Glu_TTC
     64 Lys_TTT
     54 Glu_CTC
     50 Asp_GTC
     41 Lys_CTT
     37 Ala_AGC
     33 Asn_GTT
     31 Gly_TCC
     31 Met_CAT
     29 Cys_GCA
     29 Phe_GAA
     28 Val_CAC
     26 Gly_GCC
     26 Val_AAC
     23 Gln_CTG
     23 Tyr_GTA
     20 Gln_TTG
     19 Ile_AAT
     19 SeC_TCA
     19 Thr_AGT
     17 Gly_CCC
     17 Ser_GCT
     17 Undet_???
     16 Ala_TGC
     16 His_GTG
     16 Leu_TAA
     14 Leu_CAG
     14 Pro_AGG
     13 Arg_CCT
     13 Leu_AAG
     13 Ser_AGA
     13 Trp_CCA
     11 Leu_CAA
     11 Val_TAC
     10 Arg_TCG
     10 Ile_TAT
     10 Thr_TGT
      9 Pro_TGG
      8 Ala_CGC
      8 Arg_ACG
      7 Arg_TCT
      7 Ser_TGA
      5 Asp_ATC
      5 Leu_TAG
      5 Pro_CGG
      5 Ser_CGA
      5 Sup_TTA
      5 Thr_CGT
      4 Arg_CCG
      4 Ser_ACT
      2 Ala_GGC
      2 Asn_ATT
      2 Phe_AAA
      1 Gly_ACC
      1 Ile_GAT
      1 SeC(e)_TCA
      1 Sup_CTA
      1 Tyr_ATA
      1 Val_GAC

8. Comparison with results from the genomic tRNA database.

tRNA	Tang et al., 2009	Genomic tRNA database
Glu_TTC	199	450
Lys_TTT	64	81
Glu_CTC	54	171
Asp_GTC	50	48
Lys_CTT	41	40
Ala_AGC	37	30
Asn_GTT	33	40
Gly_TCC	31	398
Met_CAT	31	33
Cys_GCA	29	353
Phe_GAA	29	28
Val_CAC	28	45
Gly_GCC	26	62
Val_AAC	26	23
Gln_CTG	23	42
Tyr_GTA	23	38
Gln_TTG	20	18
Ile_AAT	19	17
SeC_TCA	19	25
Thr_AGT	19	14
Gly_CCC	17	1315
Ser_GCT	17	18
Ala_TGC	16	34
His_GTG	16	22
Leu_TAA	16	9
Leu_CAG	14	9
Pro_AGG	14	12
Arg_CCT	13	111
Leu_AAG	13	11
Ser_AGA	13	14
Trp_CCA	13	182
Leu_CAA	11	15
Val_TAC	11	22
Arg_TCG	10	12
Ile_TAT	10	8
Thr_TGT	10	10
Pro_TGG	9	9
Ala_CGC	8	17
Arg_ACG	8	14
Arg_TCT	7	44
Ser_TGA	7	6
Asp_ATC	5	6
Leu_TAG	5	5
Pro_CGG	5	5
Ser_CGA	5	7
Sup_TTA	5	4
Thr_CGT	5	7
Arg_CCG	4	14
Ser_ACT	4	4
Ala_GGC	2	2
Asn_ATT	2	0
Phe_AAA	2	7
Gly_ACC	1	23
Ile_GAT	1	0
SeC(e)_TCA	1	25
Sup_CTA	1	12
Tyr_ATA	1	10
Val_GAC	1	0

Spearman's rho of 0.748.

9. Lastly to find evolutionary conserved tRNAs, download the human, mouse, rat, fugu, chicken, horse, chimpanzee and dog genomes. Following the procedures listed above, run tRNAscan-SE on the respective genomes, produce tRNA fasta files for each respective genome using the Perl script in step 6. Remove intronic sequences detected by tRNAscan-SE using this script:

#!/usr/bin/perl

use strict;
use warnings;

my $usage = "Usage: $0 <tRNAscanOutput> <input.fa> <output.fa>\n";
my $infile = shift or die $usage;
my $fasta = shift or die $usage;
my $outfile = shift or die $usage;

my $acc = '';
my %seq = ();
open (IN,'<',$fasta) or die "Can't open $fasta: $!\n";
while (<IN>){
   chomp;
   #>tRNA_9_chr1_164648_164720_+_Pseudo_GCA_25.07
   if (/^>(.*)$/){
      $acc = $1;
   }
   elsif (/[AGTCagtc]+/){
      $seq{$acc} .= $_;
   }
   else {
      die "Can't parse line $.: $_\n";
   }
}
close(IN);

my $counter = '0';
my $intron = '0';
my $no_intron = '0';

open (OUT, '>', $outfile) or die "Can't open $outfile for writing: $!\n";
open (IN, '<', $infile) or die "Can't open $infile\n";
while (<IN>){
   chomp;
   next if (/^Sequence/);
   next if (/^Name/);
   next if (/^--/);
   next if (/^$/);
   next if (/Pseudo/);
   ++$counter;
   #Sequence                tRNA            Bounds          tRNA    Anti    Intron Bounds   Cove
   #Name            tRNA #  Begin           End             Type    Codon   Begin   End     Score
   #--------        ------  ----            ------          ----    -----   -----   ----    ------
   #chr1            1       32999           33071           Pseudo  GCA     0       0       20.17
   my ($acc, $trnaNo, $trnaStart, $trnaEnd, $trnaType, $antiCodon, $intronStart, $intronEnd, $coveScore) = split(/\s+/);
   if ($acc =~ /chrUn/){
      $acc =~ s/chr//;
   }
   #tRNAscan likes to add a space to the coordinates randomly
   $trnaStart =~ s/\s+//g;
   $trnaEnd =~ s/\s+//g;
   my $start = $trnaStart;
   my $end = $trnaEnd;
   my $strand = '+';
   if ($trnaStart > $trnaEnd){
      $strand = '-';
      ($trnaStart, $trnaEnd) = ($trnaEnd, $trnaStart);
   }
   if ($intronStart > $intronEnd){
      ($intronStart,$intronEnd) = ($intronEnd,$intronStart);
   }
   #my $id = "tRNA_${trnaNo}_${acc}_${trnaStart}_${trnaEnd}_${strand}_${trnaType}_${antiCodon}_${coveScore}";
   my $id = "${acc}_${trnaStart}_${trnaEnd}_${strand}_${trnaType}_${antiCodon}_${coveScore}";
   #Check if tRNAs are annotated correctly
   unless (exists $seq{$id}){
      die "tRNA $id does not exist in fasta\n";
   }
   my $sequence = $seq{$id};
   my $length = length($sequence);
   if ($intronStart != 0 && $intronEnd != 0){
      ++$intron;
      #Offset starts at 0 for substring
      #Coordinate for end of first part
      my $firstPartCo = $intronStart - $trnaStart;
      #Coordinate for start of second part
      my $secondPartCo = $intronEnd - $trnaStart;
      #Coordinate for end of second part
      my $thirdPartCo = $trnaEnd - $intronEnd;
      #IF intron is on negative strand need to start counting from the other end
      if ($strand eq '-'){
         $sequence = reverse($sequence);
         #Offset starts at 0 for substring
         #Coordinate for end of first part
         my $firstPartCo = $intronStart - $trnaStart;
         #Coordinate for start of second part
         my $secondPartCo = $intronEnd - $trnaStart;
         #Coordinate for end of second part
         my $thirdPartCo = $trnaEnd - $intronEnd;

         my $first = substr($sequence,0,$firstPartCo);
         $first = reverse($first);
         #0 based so add 1 to move 1 base to the right
         my $second = substr($sequence,$secondPartCo+1,$thirdPartCo);
         $second = reverse($second);
         my $tempSeq = $second . $first;
         $length = length ($tempSeq);
         print OUT ">$id\n";
         print OUT "$second$first\n";
      }
      else {
         my $first = substr($sequence,0,$firstPartCo);
         #0 based so add 1 to move 1 base to the right
         my $second = substr($sequence,$secondPartCo+1,$thirdPartCo);
         my $tempSeq = $first . $second;
         $length = length ($tempSeq);
         print OUT ">$id\n";
         print OUT "$first$second\n";
      }
   }
   else {
      ++$no_intron;
      print OUT ">$id\n";
      print OUT "$sequence\n";
   }
}
close (IN);
close (OUT);
warn "Processed $counter tRNAs and there were $intron tRNAs with intronic sequence and $no_intron without\n";

exit(0);

Using each respective fasta file (where there are no intronic sequences), make BLAST databases and finally BLAST the initial cow tRNA fasta set to each respective tRNA database from the different genomes. We will use a sequence identity of >= 95% as a threshold for orthology.

/home/tan118/work/tRNAscan/trnascan_se_to_fasta.pl genome/human/hg19_trnascan_se.output genome/human/hg19.fa hg19_trnascan_se.fa
/home/tan118/work/tRNAscan/trnascan_se_to_fasta.pl genome/mouse/mm9_trnascan_se.output genome/mouse/mm9.fa mm9_trnascan_se.fa
/home/tan118/work/tRNAscan/trnascan_se_to_fasta.pl genome/rat/rn4_trnascan_se.output genome/rat/rn4.fa rn4_trnascan_se.fa
/home/tan118/work/tRNAscan/trnascan_se_to_fasta.pl genome/fugu/fr2_trnascan_se.output genome/fugu/fr2.fa fr2_trnascan_se.fa
/home/tan118/work/tRNAscan/trnascan_se_to_fasta.pl genome/horse/equ_cab_2_trnascan_se.output genome/horse/equ_cab_2.fa equ_cab_2_trnascan_se.fa
/home/tan118/work/tRNAscan/trnascan_se_to_fasta.pl genome/dog/canfam2_trnascan_se.output genome/dog/canfam2.fa canfam2_trnascan_se.fa
/home/tan118/work/tRNAscan/trnascan_se_to_fasta.pl genome/chicken/galgal3_trnascan_se.output genome/chicken/gal_gal_3.fa galgal3_trnascan_se.fa
/home/tan118/work/tRNAscan/trnascan_se_to_fasta.pl genome/chimpanzee/pan_tro_3_trnascan_se.output genome/chimpanzee/panTro3.fa pan_tro_3_trnascan_se.fa

makeblastdb -in hg19_trnascan_se_no_intron.fa -dbtype nucl -title human_trna -out human_trna
makeblastdb -in mm9_trnascan_se_no_intron.fa -dbtype nucl -title mouse_trna -out mouse_trna
makeblastdb -in rn4_trnascan_se_no_intron.fa -dbtype nucl -title rat_trna -out rat_trna
makeblastdb -in fr2_trnascan_se_no_intron.fa -dbtype nucl -title fugu_trna -out fugu_trna
makeblastdb -in equ_cab_2_trnascan_se_no_intron.fa -dbtype nucl -title horse_trna -out horse_trna
makeblastdb -in canfam2_trnascan_se_no_intron.fa -dbtype nucl -title dog_trna -out dog_trna
makeblastdb -in galgal3_trnascan_se_no_intron.fa -dbtype nucl -title chicken_trna -out chicken_trna
makeblastdb -in pan_tro_3_trnascan_se_no_intron.fa -dbtype nucl -title chimpanzee_trna -out chimpanzee_trna

blastn -db chicken_trna -outfmt 7 -evalue 1 -query bos_tau_5_trnascan_se_no_intron.fa -out bos_tau_5_trna_to_chicken_trna_no_intron.blastout
blastn -db mouse_trna -outfmt 7 -evalue 1 -query bos_tau_5_trnascan_se_no_intron.fa -out bos_tau_5_trna_to_mouse_trna_no_intron.blastout
blastn -db horse_trna -outfmt 7 -evalue 1 -query bos_tau_5_trnascan_se_no_intron.fa -out bos_tau_5_trna_to_horse_trna_no_intron.blastout
blastn -db dog_trna -outfmt 7 -evalue 1 -query bos_tau_5_trnascan_se_no_intron.fa -out bos_tau_5_trna_to_dog_trna_no_intron.blastout
blastn -db rat_trna -outfmt 7 -evalue 1 -query bos_tau_5_trnascan_se_no_intron.fa -out bos_tau_5_trna_to_rat_trna_no_intron.blastout
blastn -db human_trna -outfmt 7 -evalue 1 -query bos_tau_5_trnascan_se_no_intron.fa -out bos_tau_5_trna_to_human_trna_no_intron.blastout
blastn -db fugu_trna -outfmt 7 -evalue 1 -query bos_tau_5_trnascan_se_no_intron.fa -out bos_tau_5_trna_to_fugu_trna_no_intron.blastout
blastn -db chimpanzee_trna -outfmt 7 -evalue 1 -query bos_tau_5_trnascan_se_no_intron.fa -out bos_tau_5_trna_to_chimpanzee_trna_no_intron.blastout

To find conserved tRNAs and the corresponding phylogenetic profile, use this script:

#!/usr/bin/perl

use strict;
use warnings;

my $usage = "Usage: $0 <threshold i.e. 1.00 -> 0.50>\n";
my $trna_fasta = 'bos_tau_5_trnascan_se_no_intron.fa';
my $threshold = shift or die $usage;

my @files = qw/
bos_tau_5_trna_to_chicken_trna_no_intron.blastout
bos_tau_5_trna_to_dog_trna_no_intron.blastout
bos_tau_5_trna_to_fugu_trna_no_intron.blastout
bos_tau_5_trna_to_horse_trna_no_intron.blastout
bos_tau_5_trna_to_human_trna_no_intron.blastout
bos_tau_5_trna_to_mouse_trna_no_intron.blastout
bos_tau_5_trna_to_rat_trna_no_intron.blastout
bos_tau_5_trna_to_chimpanzee_trna_no_intron.blastout
/;

my %phyloProfile = ();
my %tRNAlist = ();
open (IN, '<',"$trna_fasta") or die "Can't open $trna_fasta: $!\n";
while (<IN>){
   chomp;
   if (/^>(.*)/){
      my $trna = $1;
      my $seq = <IN>;
      chomp($seq);
      my $length = length($seq);
      $tRNAlist{$trna} = $length;
   } else {
      die "Could not parse line $.: $_\n";
   }
}
close (IN);

foreach my $currentFile (@files){
   open (IN, '<', $currentFile) or die "Can't open $currentFile: $!\n";
   while (<IN>){
      chomp;
      next if /^#/;
      my ($query, $subject, $id, $aliLen, $mismatch, $gapOpen, $qStart, $qEnd, $sStart, $sEnd, $eValue, $bitScore) = split (/\t/);
      if (exists $tRNAlist{$query}){
         my $percentMatch = $id * $aliLen / '100';
         if ($percentMatch =~ /(\d+)\.(\d+)/){
            my $first = $1;
            my $second = $2;
            if ($second =~ /^9/){
               $percentMatch = $first + 1;
            }
            elsif ($second =~ /^0/){
               $percentMatch = $first;
            }
         }
         if (($percentMatch / $tRNAlist{$query}) >= $threshold){
            $phyloProfile{$query}{$currentFile} = '1';
         }
      }
      else {
         die "Query $query is missing\n";
      }
   }
   close (IN);
}

foreach my $trna (keys %tRNAlist){
   print "$trna\t";
   my $line = '';
   foreach my $files (@files){
      if (exists $phyloProfile{$trna}{$files}){
         $line .= '1';
      }
      else {
         $line .= '0';
      }
   }
   my $letter = substr($line,0,1);
   substr($line,0,1,"1$letter");
   print "$line\n";
}

The output from the above script should look something like this:

tRNA_2015_chrX_56490796_56490867_+_Undet_???_33.67 100000000
tRNA_3466_chr23_30023502_30023573_-_Glu_CTC_76.61 111111111
tRNA_1640_chr24_40048717_40048789_+_Gly_GCC_22.38 100000000

To identify the list of perfectly conserved tRNA genes, the above script with a threshold of 1 (i.e. 100%).

phylogenetic_profile_thresholding.pl 1 | grep 111111111 > perfect.list

Then use this script to extract the sequences:

#!/usr/bin/perl

use strict;
use warnings;

#>BTA1.trna335-ValAAC (5010368-5010440) Val (AAC) 73 bp Sc: 47.34

my $current = '';
my %fasta = ();
my %annotation =();
open(IN,'<','bos_tau_5_trnascan_se_no_intron.fa') || die "Could not open bos_tau_5_trnascan_se_no_intron.fa: $!\n";
while(<IN>){
   chomp;
   next if (/^$/);
   if (/^>(.*)$/){
      $current = $1;
      next;
   } else {
      $fasta{$current} .= $_;
   }
}
close(IN);

open(IN,'<','perfect.list') || die "Could not open perfect.list: $!\n";

while(<IN>){
   chomp;
   next if (/^$/);
   #BTA23.trna1281-IleAAT   111111111
   my ($id,$junk) = split(/\t/);
   if (exists $fasta{$id}){
      print ">$id\n$fasta{$id}\n";
   } else {
      die "Missing id $id\n";
   }
}
close(IN);

I came up with 171 tRNA genes (compared with 150 in the paper) using 100% conservation and the unique tRNA gene count was 26 (compared with 22 in the paper). Reasons for the increase are most likely due to improved genome assemblies.

And at the 0.95 threshold, there were 439 cow/bovine tRNA genes.

This work is licensed under a Creative Commons
Attribution 4.0 International License.

Leave a Reply Cancel reply