Here's a very primitive way of looking for motifs upstream of RefSeq gene models.
1) Download the upstream sequences (-50) of RefSeq gene models using the UCSC Table Browser tool as a bed file
2) Using the fastaFromBed tool from BEDTools, make fasta files from the bed file
fastaFromBed -s -bed hg19_refgene_upstream_50_080312.bed -fi ~/genome/human/hg19.fa -fo hg19_refgene_upstream_50_080312.fa
3) Look for motifs
Here's the main code and an example to show what it is doing
#!/usr/bin/perl use strict; use warnings; my $seq = 'ACGTACGTACGTACGT'; print "$seq\n"; my $length = length($seq); foreach my $frame (0 .. 5){ my $frame_length = $length - $frame; #print "Frame length: $frame_length\n"; my $mod = $frame_length % 6; #print "Mod: $mod\n"; $frame_length -= $mod; #print "Frame length: $frame_length\n"; for (my $i = $frame; $frame<$frame_length; $frame=$frame+6){ my $mer = substr($seq,$frame,6); my $x = ' ' x $frame; print "$x$mer\n"; } } #ACGTACGTACGTACGT #ACGTAC # GTACGT # CGTACG # TACGTA # GTACGT # ACGTAC # TACGTA # CGTACG # ACGTAC # GTACGT # CGTACG
Now to calculate the motifs from the fasta file
#!/usr/bin/perl use strict; use warnings; my %seq = (); my $current_id = ''; open(IN,'<','hg19_refgene_upstream_50_080312.fa') || die "Could not open hg19_refgene_upstream_50_080312.fa: $!\n"; while(<IN>){ chomp; next if /^$/; if (/^>(.*)$/){ $current_id = $1; } else { my $current_line = $_; $current_line = uc($current_line); $seq{$current_id} .= $current_line; } } close(IN); my %motif = (); foreach my $id (keys %seq){ my $seq = $seq{$id}; my $length = length($seq); foreach my $frame (0 .. 5){ my $frame_length = $length - $frame; my $mod = $frame_length % 6; $frame_length -= $mod; for (my $i = $frame; $frame<$frame_length; $frame=$frame+6){ my $mer = substr($seq,$frame,6); if (exists $motif{$mer}){ $motif{$mer}++; } else { $motif{$mer} = 1; } } } } foreach my $mer (sort {$motif{$b} <=> $motif{$a}} keys %motif){ print "$mer\t$motif{$mer}\n"; } exit(0);
The top ten motifs were:
GGGCGG 7514
GGCGGG 7345
CCGCCC 6706
CCCGCC 6631
GGGGCG 6227
GCGGGG 6153
CGCCCC 5766
CCCCGC 5596
CGGGGC 4822
CCTCCC 4622
Now I'm going to repeat the analysis with only the 10 nucleotides upstream of the RefSeq genes
GGCGGG 985
GGGCGG 965
CCCGCC 867
CCGCCC 838
GCGGGG 774
CGCCCC 732
CCCCGC 709
GGGGCG 705
CGGGGC 661
CCTCCC 640
Almost the same.
See related post on RefSeq promoters.

This work is licensed under a Creative Commons
Attribution 4.0 International License.