Firstly download bed tracks of the 5' UTR, 3' UTR and coding exons from the UCSC Table Browser. The RefSeq gene models are in the table called RefGene.
After you've saved the 3 bed files (e.g. mm9_refgene_090212_5_utr.bed, mm9_refgene_090212_3_utr.bed and mm9_refgene_090212_coding_exon.bed) use the fastaFromBed program from the BEDTools suite and convert the bed file into a fasta file:
fastaFromBed -fi mm9.fa -bed mm9_refgene_090212_5_utr.bed -fo mm9_refgene_090212_5_utr.fa fastaFromBed -fi mm9.fa -bed mm9_refgene_090212_3_utr.bed -fo mm9_refgene_090212_3_utr.fa fastaFromBed -fi mm9.fa -bed mm9_refgene_090212_coding_exon.bed -fo mm9_refgene_090212_coding_exon.fa
Then use this Perl script to calculate the nucleotide frequencies:
#!/usr/bin/perl use strict; use warnings; my $usage = "Usage: $0 <fasta.fa>\n"; my $infile = shift or die $usage; my $seq = ''; open (IN, $infile) or die "Can't open $infile\n"; while (<IN>){ chomp; next if (/^$/); if (/^>(.*)$/){ next; } else { $seq .= $_; } } close(IN); my ($a,$c,$g,$t,$other) = parseSeq($seq); my $total = $a + $c + $g + $t; my $gc = sprintf("%.2f",($c + $g) * 100 / $total); my $at = sprintf("%.2f",($a + $t) * 100 / $total); print "Length = ", length($seq), "\n"; print "A: $a\n"; print "C: $c\n"; print "G: $g\n"; print "T: $t\n"; print "Other: $other\n"; print "GC: $gc\n"; print "AT: $at\n"; sub parseSeq { my ($seq) = @_; my $a = '0'; my $c = '0'; my $g = '0'; my $t = '0'; my $other = '0'; for (my $i = 0; $i < length($seq); ++$i){ my $base = substr($seq,$i,1); if ($base =~ /a/i){ ++$a; } elsif ($base =~ /c/i){ ++$c; } elsif ($base =~ /g/i){ ++$g; } elsif ($base =~ /t/i){ ++$t; } else { ++$other; } } return ($a,$c,$g,$t,$other); }
Running the Perl script on all the fasta files:
for file in `ls *.fa`; do echo $file; tally_nuc.pl $file; done
mm9_refgene_090212_3_utr.fa
Length = 32985433
A: 9238398
C: 7238453
G: 7238331
T: 9270249
Other: 2
GC: 43.89
AT: 56.11
mm9_refgene_090212_5_utr.fa
Length = 7543736
A: 1654667
C: 2115587
G: 2124322
T: 1649160
Other: 0
GC: 56.20
AT: 43.80
mm9_refgene_090212_coding_exon.fa
Length = 44137007
A: 10628969
C: 11450211
G: 11459637
T: 10598190
Other: 0
GC: 51.91
AT: 48.09
In summary, the 5' UTR has the highest GC content (GC: 56.20 and AT: 43.80), the coding exons have no particular bias (GC: 51.91 and AT: 48.09) and the 3' UTR has the lowest GC content (GC: 43.89 and AT: 56.11) of the three categories.

This work is licensed under a Creative Commons
Attribution 4.0 International License.