Motifs upstream of RefSeq gene models

Here's a very primitive way of looking for motifs upstream of RefSeq gene models.

1) Download the upstream sequences (-50) of RefSeq gene models using the UCSC Table Browser tool as a bed file
2) Using the fastaFromBed tool from BEDTools, make fasta files from the bed file

fastaFromBed -s -bed hg19_refgene_upstream_50_080312.bed -fi ~/genome/human/hg19.fa -fo hg19_refgene_upstream_50_080312.fa

3) Look for motifs

Here's the main code and an example to show what it is doing

#!/usr/bin/perl

use strict;
use warnings;

my $seq = 'ACGTACGTACGTACGT';

print "$seq\n";

my $length = length($seq);
foreach my $frame (0 .. 5){
   my $frame_length = $length - $frame;
   #print "Frame length: $frame_length\n";
   my $mod = $frame_length % 6;
   #print "Mod: $mod\n";
   $frame_length -= $mod;
   #print "Frame length: $frame_length\n";
   for (my $i = $frame; $frame<$frame_length; $frame=$frame+6){
      my $mer = substr($seq,$frame,6);
      my $x = ' ' x $frame;
      print "$x$mer\n";
   }
}
#ACGTACGTACGTACGT
#ACGTAC
#      GTACGT
# CGTACG
#       TACGTA
#  GTACGT
#        ACGTAC
#   TACGTA
#         CGTACG
#    ACGTAC
#          GTACGT
#     CGTACG

Now to calculate the motifs from the fasta file

#!/usr/bin/perl

use strict;
use warnings;

my %seq = ();
my $current_id = '';
open(IN,'<','hg19_refgene_upstream_50_080312.fa') || die "Could not open hg19_refgene_upstream_50_080312.fa: $!\n";
while(<IN>){
   chomp;
   next if /^$/;
   if (/^>(.*)$/){
      $current_id = $1;
   } else {
      my $current_line = $_;
      $current_line = uc($current_line);
      $seq{$current_id} .= $current_line;
   }
}
close(IN);

my %motif = ();

foreach my $id (keys %seq){
   my $seq = $seq{$id};

   my $length = length($seq);
   foreach my $frame (0 .. 5){
      my $frame_length = $length - $frame;
      my $mod = $frame_length % 6;
      $frame_length -= $mod;
      for (my $i = $frame; $frame<$frame_length; $frame=$frame+6){
         my $mer = substr($seq,$frame,6);
         if (exists $motif{$mer}){
            $motif{$mer}++;
         } else {
            $motif{$mer} = 1;
         }
      }
   }
}

foreach my $mer (sort {$motif{$b} <=> $motif{$a}} keys %motif){
   print "$mer\t$motif{$mer}\n";
}

exit(0);

The top ten motifs were:

GGGCGG 7514
GGCGGG 7345
CCGCCC 6706
CCCGCC 6631
GGGGCG 6227
GCGGGG 6153
CGCCCC 5766
CCCCGC 5596
CGGGGC 4822
CCTCCC 4622

Now I'm going to repeat the analysis with only the 10 nucleotides upstream of the RefSeq genes

GGCGGG 985
GGGCGG 965
CCCGCC 867
CCGCCC 838
GCGGGG 774
CGCCCC 732
CCCCGC 709
GGGGCG 705
CGGGGC 661
CCTCCC 640

Almost the same.

See related post on RefSeq promoters.

Genome scan for 6mer frequency

Split the genome into 6 bp windows and calculate the 6 mer frequencies.

#!/usr/bin/perl

use strict;
use warnings;

my $usage = "Usage $0 <infile.fa>\n";
my $infile = shift or die $usage;

my $seq = '';
open(IN,'<',$infile) || die "Could not open $infile: $!\n";
while(<IN>){
   chomp;
   next if (/^>/);
   $seq .= $_;
}
close(IN);

my %six_mer = ();
my $length = length($seq);
foreach my $frame (0 .. 5){
   my $frame_length = $length - $frame;
   #print "Frame length: $frame_length\n";
   my $mod = $frame_length % 6;
   #print "Mod: $mod\n";
   $frame_length -= $mod;
   #print "Frame length: $frame_length\n";
   for (my $i = $frame; $frame<$frame_length; $frame=$frame+6){
      my $mer = substr($seq,$frame,6);
      #print "$mer\n";
      if (exists $six_mer{$mer}){
         $six_mer{$mer}++;
      } else {
         $six_mer{$mer} = '1';
      }
   }
}

print "$infile\n";
foreach my $mer (sort {$six_mer{$b} <=> $six_mer{$a} } keys %six_mer){
   print "$mer: $six_mer{$mer}\n";
}
print "\n";

exit(0);

Scanning chr6 of hg19:

NNNNNN: 3719950
aaaaaa: 373380
tttttt: 372667
TTTTTT: 184768
AAAAAA: 182652
aaaaat: 143055
attttt: 142671
ATTTTT: 133646
TTTAAA: 133284
AAAAAT: 133130
TATTTT: 130672
AAAATA: 129572
TTTTAA: 129570
TTAAAA: 129177
aaaata: 123528
tatttt: 123134
atatat: 119872
...