Motifs upstream of RefSeq gene models

Here's a very primitive way of looking for motifs upstream of RefSeq gene models.

1) Download the upstream sequences (-50) of RefSeq gene models using the UCSC Table Browser tool as a bed file
2) Using the fastaFromBed tool from BEDTools, make fasta files from the bed file

fastaFromBed -s -bed hg19_refgene_upstream_50_080312.bed -fi ~/genome/human/hg19.fa -fo hg19_refgene_upstream_50_080312.fa

3) Look for motifs

Here's the main code and an example to show what it is doing

#!/usr/bin/perl

use strict;
use warnings;

my $seq = 'ACGTACGTACGTACGT';

print "$seq\n";

my $length = length($seq);
foreach my $frame (0 .. 5){
   my $frame_length = $length - $frame;
   #print "Frame length: $frame_length\n";
   my $mod = $frame_length % 6;
   #print "Mod: $mod\n";
   $frame_length -= $mod;
   #print "Frame length: $frame_length\n";
   for (my $i = $frame; $frame<$frame_length; $frame=$frame+6){
      my $mer = substr($seq,$frame,6);
      my $x = ' ' x $frame;
      print "$x$mer\n";
   }
}
#ACGTACGTACGTACGT
#ACGTAC
#      GTACGT
# CGTACG
#       TACGTA
#  GTACGT
#        ACGTAC
#   TACGTA
#         CGTACG
#    ACGTAC
#          GTACGT
#     CGTACG

Now to calculate the motifs from the fasta file

#!/usr/bin/perl

use strict;
use warnings;

my %seq = ();
my $current_id = '';
open(IN,'<','hg19_refgene_upstream_50_080312.fa') || die "Could not open hg19_refgene_upstream_50_080312.fa: $!\n";
while(<IN>){
   chomp;
   next if /^$/;
   if (/^>(.*)$/){
      $current_id = $1;
   } else {
      my $current_line = $_;
      $current_line = uc($current_line);
      $seq{$current_id} .= $current_line;
   }
}
close(IN);

my %motif = ();

foreach my $id (keys %seq){
   my $seq = $seq{$id};

   my $length = length($seq);
   foreach my $frame (0 .. 5){
      my $frame_length = $length - $frame;
      my $mod = $frame_length % 6;
      $frame_length -= $mod;
      for (my $i = $frame; $frame<$frame_length; $frame=$frame+6){
         my $mer = substr($seq,$frame,6);
         if (exists $motif{$mer}){
            $motif{$mer}++;
         } else {
            $motif{$mer} = 1;
         }
      }
   }
}

foreach my $mer (sort {$motif{$b} <=> $motif{$a}} keys %motif){
   print "$mer\t$motif{$mer}\n";
}

exit(0);

The top ten motifs were:

GGGCGG 7514
GGCGGG 7345
CCGCCC 6706
CCCGCC 6631
GGGGCG 6227
GCGGGG 6153
CGCCCC 5766
CCCCGC 5596
CGGGGC 4822
CCTCCC 4622

Now I'm going to repeat the analysis with only the 10 nucleotides upstream of the RefSeq genes

GGCGGG 985
GGGCGG 965
CCCGCC 867
CCGCCC 838
GCGGGG 774
CGCCCC 732
CCCCGC 709
GGGGCG 705
CGGGGC 661
CCTCCC 640

Almost the same.

See related post on RefSeq promoters.




Creative Commons License
This work is licensed under a Creative Commons
Attribution 4.0 International License
.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.