GC and AT content of 5′ UTRs, 3′ UTRs and coding exons of RefSeq gene models

Firstly download bed tracks of the 5' UTR, 3' UTR and coding exons from the UCSC Table Browser. The RefSeq gene models are in the table called RefGene.

After you've saved the 3 bed files (e.g. mm9_refgene_090212_5_utr.bed, mm9_refgene_090212_3_utr.bed and mm9_refgene_090212_coding_exon.bed) use the fastaFromBed program from the BEDTools suite and convert the bed file into a fasta file:

fastaFromBed -fi mm9.fa -bed mm9_refgene_090212_5_utr.bed -fo mm9_refgene_090212_5_utr.fa
fastaFromBed -fi mm9.fa -bed mm9_refgene_090212_3_utr.bed -fo mm9_refgene_090212_3_utr.fa
fastaFromBed -fi mm9.fa -bed mm9_refgene_090212_coding_exon.bed -fo mm9_refgene_090212_coding_exon.fa

Then use this Perl script to calculate the nucleotide frequencies:


#!/usr/bin/perl

use strict;
use warnings;

my $usage = "Usage: $0 <fasta.fa>\n";
my $infile = shift or die $usage;

my $seq = '';

open (IN, $infile) or die "Can't open $infile\n";
while (<IN>){
   chomp;
   next if (/^$/);
   if (/^>(.*)$/){
      next;
   }
   else {
      $seq .= $_;
   }
}
close(IN);

my ($a,$c,$g,$t,$other) = parseSeq($seq);

my $total = $a + $c + $g + $t;
my $gc = sprintf("%.2f",($c + $g) * 100 / $total);
my $at = sprintf("%.2f",($a + $t) * 100 / $total);

print "Length = ", length($seq), "\n";
print "A: $a\n";
print "C: $c\n";
print "G: $g\n";
print "T: $t\n";
print "Other: $other\n";

print "GC: $gc\n";
print "AT: $at\n";

sub parseSeq {
   my ($seq) = @_;
   my $a = '0';
   my $c = '0';
   my $g = '0';
   my $t = '0';
   my $other = '0';
   for (my $i = 0; $i < length($seq); ++$i){
      my $base = substr($seq,$i,1);
      if ($base =~ /a/i){
         ++$a;
      } elsif ($base =~ /c/i){
         ++$c;
      } elsif ($base =~ /g/i){
         ++$g;
      } elsif ($base =~ /t/i){
         ++$t;
      } else {
         ++$other;
      }
   }
   return ($a,$c,$g,$t,$other);
}

Running the Perl script on all the fasta files:

for file in `ls *.fa`; do echo $file; tally_nuc.pl $file; done

mm9_refgene_090212_3_utr.fa
Length = 32985433
A: 9238398
C: 7238453
G: 7238331
T: 9270249
Other: 2
GC: 43.89
AT: 56.11
mm9_refgene_090212_5_utr.fa
Length = 7543736
A: 1654667
C: 2115587
G: 2124322
T: 1649160
Other: 0
GC: 56.20
AT: 43.80
mm9_refgene_090212_coding_exon.fa
Length = 44137007
A: 10628969
C: 11450211
G: 11459637
T: 10598190
Other: 0
GC: 51.91
AT: 48.09

In summary, the 5' UTR has the highest GC content (GC: 56.20 and AT: 43.80), the coding exons have no particular bias (GC: 51.91 and AT: 48.09) and the 3' UTR has the lowest GC content (GC: 43.89 and AT: 56.11) of the three categories.




Creative Commons License
This work is licensed under a Creative Commons
Attribution 4.0 International License
.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.