ENCODE RNA polymerase II ChIP-seq

Chromatin immunoprecipitation sequencing (ChIP-seq) is a high throughput method for investigating protein-DNA interactions and aims to determine whether specific proteins are interacting with specific genomic loci. The workflow consists of crosslinking DNA and protein together, usually via the use of formaldehyde, which induces protein-DNA and protein-protein crosslinks. Importantly, these crosslinks are reversible by incubation at 70°C. Next the crosslinked DNA-protein complexes are sheared into roughly 500 bp fragments, usually by sonication. At this point we have "sheared DNA" and "sheared DNA crosslinked with proteins". Now comes the immunoprecipitation step, which is a technique that precipitates a protein antigen out of solution using an antibody that recognises that particular antigen. The crosslinking would result in many DNA-protein interactions and we use immunoprecipitation to pull down the protein of interest with the DNA region it was interacting with. After immunoprecipitation, the formaldehyde crosslinks are reversed by heating and the DNA strands are purified and sequenced. There's a nice graphic depicting this workflow at the Wikipedia article for ChIP-seq.

Continue reading

Annotating RNA-Seq data

After mapping your reads from an RNA-Seq experiment, usually the next task is identify the transcripts that the reads came from (i.e. annotating RNA-Seq data) and there are many ways of doing so. Here I just describe a rather crude method whereby I download sequence coordinates of hg19 RefSeqs as a BED12 file from the UCSC Genome Browser Table Browser and extrapolate the 5' UTR, exons, introns, 3' UTR and intergenic regions according to this file. I then output this as a bed file for use with intersectBed. A better (and much less error prone) method would be just to download each individual region of a RefSeq transcript from the Table Browser.

The process would have been quite straight forward, however since transcripts are located on the positive and negative strands and are spliced (even 5' and 3' UTRs), the script I wrote to create the annotation bed file is quite long and may contain unwanted features i.e. bugs. To use the script first download the refGene BED12 file from the UCSC Genome Browser Table Browser. Then sort the file by chromosome and then by the starting coordinate i.e. cat refgene.bed12 | sort -k1,1 -k2,2n > refgene_sorted.bed12. NOTE: this script will only work for the RefSeqs mapped to hg19, since I have hardcoded the chromosome ends within the script.

Without further ado:


#!/usr/bin/perl

use strict;
use warnings;

my $usage = "Usage: $0 <infile.bed>\n";
my $infile = shift or die $usage;

my %chr_end = ();
#DATA used for calculating final intergenic region
while(<DATA>){
   chomp;
   my ($chr,$end) = split(/\s+/);
   $chr_end{$chr} = $end;
}
close(DATA);

#used for calculating intergenic regions
my $end_of_transcript = '-1';
my $previous_chr = 'chr1';

open(IN,'<',$infile) || die "Could not open $infile: $!\n";
while(<IN>){
   chomp;
=head1 sample bed file

chr1    11873   14409   NR_046018       0       +       14409   14409   0       3       354,109,1189,   0,739,1347,
chr1    14361   29370   NR_024540       0       -       29370   29370   0       11      468,69,152,159,198,136,137,147,99,154,50,       0,608,1434,2245,2496,2871,3244,3553,3906,10376,14959,
chr1    34610   36081   NR_026818       0       -       36081   36081   0       3       564,205,361,    0,666,1110,

=cut
   my($chr,$start,$end,$id,$score,$strand,$thick_start,$thick_end,$rgb,$block_count,$block_size,$block_start) = split(/\t/);
   #just in case
   $chr= lc $chr;

   #work only with the assembled chromosomes
   next unless exists $chr_end{$chr};

   my @block_size = split(/,/,$block_size);
   my @block_start = split(/,/,$block_start);

   #declare UTR variables
   my $five_prime_utr_start = '0';
   my $five_prime_utr_end = '0';
   my $three_prime_utr_start = '0';
   my $three_prime_utr_end = '0';
   #utr information is present when the thick_start is not equal to the start and end
   if ($end != $thick_start && $thick_start != $start ){
      #chr1    861120  879961  NM_152486       0       +       861321  879533
      my $utr_start = $start;
      my $utr_end = $thick_start + 1;
      if ($strand eq '+'){
         $five_prime_utr_start = $utr_start;
         $five_prime_utr_end = $utr_end;
      } else {
         $three_prime_utr_start =$utr_end;
         $three_prime_utr_end =$utr_start;
      }
   }
   if ($end != $thick_end){
      my $utr_start = $thick_end;
      my $utr_end = $end + 1;
      if ($strand eq '+'){
         $three_prime_utr_start = $utr_start;
         $three_prime_utr_end = $utr_end;
      } else {
         $five_prime_utr_start =$utr_end;
         $five_prime_utr_end =$utr_start;
      }
   }

   #declare arrays for introns and exons
   my @exon_start = ();
   my @exon_end = ();
   my @intron_start = ();
   my @intron_end = ();

   #go through each block from right to left
   for (my $block=$block_count-1; $block>=0; --$block){
      #exons
      #0 based coordinates
      my $exon_start = $start + $block_start[$block] + 1;
      my $exon_end = $start + $block_start[$block] + $block_size[$block];
      #print "Exon: $block\t$exon_start\t$exon_end\n";

      if ($strand eq '+'){
         if ($three_prime_utr_start != 0){
            if ($exon_start > $three_prime_utr_start){
               #print "3' UTR is spliced:\t$exon_start\t$exon_end\n";
               print join("\t",$chr,$exon_start,$exon_end,"3_UTR_$id",'0',$strand),"\n";
            } else {
               #print "3' UTR:\t$three_prime_utr_start\t$exon_end\n";
               print join("\t",$chr,$three_prime_utr_start,$exon_end,"3_UTR_$id",'0',$strand),"\n";
               $three_prime_utr_start = '0';
            }
         }
      } else {
         if ($five_prime_utr_end != 0){
            if ($exon_start > $five_prime_utr_end){
               #print "5' UTR is spliced:\t$exon_start\t$exon_end\n";
               print join("\t",$chr,$exon_start,$exon_end,"5_UTR_$id",'0',$strand),"\n";
            } else {
               #print "5' UTR:\t$five_prime_utr_end\t$exon_end\n";
               print join("\t",$chr,$five_prime_utr_end,$exon_end,"5_UTR_$id",'0',$strand),"\n";
               $five_prime_utr_end = '0';
            }
         }
      }
   }

   my $previous_exon_end = '';
   #go through each block from left to right
   for (1 .. $block_count){
      #array indexing starts with 0
      my $block = $_ - 1;

      #exons
      #0 based coordinates
      my $exon_start = $start + $block_start[$block] + 1;
      my $exon_end = $start + $block_start[$block] + $block_size[$block];
      #print "$id\t$exon_start\t$exon_end\n";
      print join("\t",$chr,$exon_start,$exon_end,"EXON_$id",'0',$strand),"\n";
      $exon_start[$block] = $exon_start;
      $exon_end[$block] = $exon_end;
      #print "Exon $block:\t$exon_start\t$exon_end\n";

      if ($strand eq '+'){
         if ($five_prime_utr_end != 0){
            if ($exon_end < $five_prime_utr_end){
               #print "5' UTR is spliced:\t$exon_start\t$exon_end\n";
               print join("\t",$chr,$exon_start,$exon_end,"5_UTR_$id",'0',$strand),"\n";
            } else {
               #print "5' UTR:\t$exon_start\t$five_prime_utr_end\n";
               print join("\t",$chr,$exon_start,$five_prime_utr_end,"5_UTR_$id",'0',$strand),"\n";
               $five_prime_utr_end = '0';
            }
         }
      } else {
         if ($three_prime_utr_start != 0){
            if ($exon_end < $three_prime_utr_start){
               #print "3' UTR is spliced:\t$exon_start\t$exon_end\n";
               print join("\t",$chr,$exon_start,$exon_end,"3_UTR_$id",'0',$strand),"\n";
            } else {
               #print "3' UTR:\t$exon_start\t$three_prime_utr_start\n";
               print join("\t",$chr,$exon_start,$three_prime_utr_start,"3_UTR_$id",'0',$strand),"\n";
               $three_prime_utr_start = '0';
            }
         }
      }

      #introns
      if ($block == 0){
         $previous_exon_end = $exon_end;
      } else {
         my $intron_start = $previous_exon_end + 1;
         my $intron_end = $exon_start - 1;
         #print "Intron $block:\t$intron_start\t$intron_end\n";
         print join("\t",$chr,$intron_start,$intron_end,"INTRON_$id",'0',$strand),"\n";
         $intron_start[$block] = $intron_start;
         $intron_end[$block] = $intron_end;
         $previous_exon_end = $exon_end;
      }
   }

   #first and last exons
   #uncomment out if you want to store first and last exons
   if ($block_count > 1){
      if ($strand eq '+'){
         my $first_exon_start = $exon_start[0];
         my $first_exon_end = $exon_end[0];
         my $last_exon_start = $exon_start[-1];
         my $last_exon_end = $exon_end[-1];
         #print "First exon:\t$first_exon_start\t$first_exon_end\n";
         #print "Last exon:\t$last_exon_start\t$last_exon_end\n";
      } elsif ($strand eq '-'){
         my $first_exon_start = $exon_start[-1];
         my $first_exon_end = $exon_end[-1];
         my $last_exon_start = $exon_start[0];
         my $last_exon_end = $exon_end[0];
         #print "First exon:\t$first_exon_start\t$first_exon_end\n";
         #print "Last exon:\t$last_exon_start\t$last_exon_end\n";
      } else {
         die "Incorrect strand information: $strand\n";
      }
   }

   #intergenic
   #the file is sorted, so this is the last transcript on the chromosome
   if ($chr ne $previous_chr){
      my $intergenic_start = $end_of_transcript + 1;
      my $intergenic_end = $chr_end{$previous_chr} - 1;
      $end_of_transcript = '-1';
      #print "$intergenic_start\t$intergenic_end\n";
      print join("\t",$chr,$intergenic_start,$intergenic_end,'INTERGENIC','0','0'),"\n";
      $previous_chr = $chr;
   }

   my $intergenic_start = $end_of_transcript + 1;
   my $intergenic_end = $start - 1;
   #print "$intergenic_start\t$intergenic_end\n";
   #this is to avoid overlapping transcript
   if ($intergenic_start >= $intergenic_end){
      #overlapping transcript
   } else {
      print join("\t",$chr,$intergenic_start,$intergenic_end,'INTERGENIC','0','0'),"\n";
   }
   $end_of_transcript = $end;

   #print "---\n\n";

}

close(IN);
exit(0);

#hg19 chromosome ends used for calculating last intergenic region
__DATA__
chr1    249250621
chr2    243199373
chr3    198022430
chr4    191154276
chr5    180915260
chr6    171115067
chr7    159138663
chrX    155270560
chr8    146364022
chr9    141213431
chr10   135534747
chr11   135006516
chr12   133851895
chr13   115169878
chr14   107349540
chr15   102531392
chr16   90354753
chr17   81195210
chr18   78077248
chr20   63025520
chrY    59373566
chr19   59128983
chr22   51304566
chr21   48129895

You should end up with a bed file as such:

chr1 1264277 1266724 INTERGENIC 0 0
chr1 1284445 1284492 5_UTR_NM_004421 0 -
chr1 1270658 1271895 EXON_NM_004421 0 -
chr1 1270658 1271522 3_UTR_NM_004421 0 -
chr1 1273357 1273563 EXON_NM_004421 0 -
chr1 1271896 1273356 INTRON_NM_004421 0 -
chr1 1273649 1273816 EXON_NM_004421 0 -
chr1 1273564 1273648 INTRON_NM_004421 0 -

Motifs upstream of RefSeq gene models

Here's a very primitive way of looking for motifs upstream of RefSeq gene models.

1) Download the upstream sequences (-50) of RefSeq gene models using the UCSC Table Browser tool as a bed file
2) Using the fastaFromBed tool from BEDTools, make fasta files from the bed file

fastaFromBed -s -bed hg19_refgene_upstream_50_080312.bed -fi ~/genome/human/hg19.fa -fo hg19_refgene_upstream_50_080312.fa

3) Look for motifs

Here's the main code and an example to show what it is doing

#!/usr/bin/perl

use strict;
use warnings;

my $seq = 'ACGTACGTACGTACGT';

print "$seq\n";

my $length = length($seq);
foreach my $frame (0 .. 5){
   my $frame_length = $length - $frame;
   #print "Frame length: $frame_length\n";
   my $mod = $frame_length % 6;
   #print "Mod: $mod\n";
   $frame_length -= $mod;
   #print "Frame length: $frame_length\n";
   for (my $i = $frame; $frame<$frame_length; $frame=$frame+6){
      my $mer = substr($seq,$frame,6);
      my $x = ' ' x $frame;
      print "$x$mer\n";
   }
}
#ACGTACGTACGTACGT
#ACGTAC
#      GTACGT
# CGTACG
#       TACGTA
#  GTACGT
#        ACGTAC
#   TACGTA
#         CGTACG
#    ACGTAC
#          GTACGT
#     CGTACG

Now to calculate the motifs from the fasta file

#!/usr/bin/perl

use strict;
use warnings;

my %seq = ();
my $current_id = '';
open(IN,'<','hg19_refgene_upstream_50_080312.fa') || die "Could not open hg19_refgene_upstream_50_080312.fa: $!\n";
while(<IN>){
   chomp;
   next if /^$/;
   if (/^>(.*)$/){
      $current_id = $1;
   } else {
      my $current_line = $_;
      $current_line = uc($current_line);
      $seq{$current_id} .= $current_line;
   }
}
close(IN);

my %motif = ();

foreach my $id (keys %seq){
   my $seq = $seq{$id};

   my $length = length($seq);
   foreach my $frame (0 .. 5){
      my $frame_length = $length - $frame;
      my $mod = $frame_length % 6;
      $frame_length -= $mod;
      for (my $i = $frame; $frame<$frame_length; $frame=$frame+6){
         my $mer = substr($seq,$frame,6);
         if (exists $motif{$mer}){
            $motif{$mer}++;
         } else {
            $motif{$mer} = 1;
         }
      }
   }
}

foreach my $mer (sort {$motif{$b} <=> $motif{$a}} keys %motif){
   print "$mer\t$motif{$mer}\n";
}

exit(0);

The top ten motifs were:

GGGCGG 7514
GGCGGG 7345
CCGCCC 6706
CCCGCC 6631
GGGGCG 6227
GCGGGG 6153
CGCCCC 5766
CCCCGC 5596
CGGGGC 4822
CCTCCC 4622

Now I'm going to repeat the analysis with only the 10 nucleotides upstream of the RefSeq genes

GGCGGG 985
GGGCGG 965
CCCGCC 867
CCGCCC 838
GCGGGG 774
CGCCCC 732
CCCCGC 709
GGGGCG 705
CGGGGC 661
CCTCCC 640

Almost the same.

See related post on RefSeq promoters.

GC and AT content of 5' UTRs, 3' UTRs and coding exons of RefSeq gene models

Firstly download bed tracks of the 5' UTR, 3' UTR and coding exons from the UCSC Table Browser. The RefSeq gene models are in the table called RefGene.

After you've saved the 3 bed files (e.g. mm9_refgene_090212_5_utr.bed, mm9_refgene_090212_3_utr.bed and mm9_refgene_090212_coding_exon.bed) use the fastaFromBed program from the BEDTools suite and convert the bed file into a fasta file:

fastaFromBed -fi mm9.fa -bed mm9_refgene_090212_5_utr.bed -fo mm9_refgene_090212_5_utr.fa
fastaFromBed -fi mm9.fa -bed mm9_refgene_090212_3_utr.bed -fo mm9_refgene_090212_3_utr.fa
fastaFromBed -fi mm9.fa -bed mm9_refgene_090212_coding_exon.bed -fo mm9_refgene_090212_coding_exon.fa

Then use this Perl script to calculate the nucleotide frequencies:


#!/usr/bin/perl

use strict;
use warnings;

my $usage = "Usage: $0 <fasta.fa>\n";
my $infile = shift or die $usage;

my $seq = '';

open (IN, $infile) or die "Can't open $infile\n";
while (<IN>){
   chomp;
   next if (/^$/);
   if (/^>(.*)$/){
      next;
   }
   else {
      $seq .= $_;
   }
}
close(IN);

my ($a,$c,$g,$t,$other) = parseSeq($seq);

my $total = $a + $c + $g + $t;
my $gc = sprintf("%.2f",($c + $g) * 100 / $total);
my $at = sprintf("%.2f",($a + $t) * 100 / $total);

print "Length = ", length($seq), "\n";
print "A: $a\n";
print "C: $c\n";
print "G: $g\n";
print "T: $t\n";
print "Other: $other\n";

print "GC: $gc\n";
print "AT: $at\n";

sub parseSeq {
   my ($seq) = @_;
   my $a = '0';
   my $c = '0';
   my $g = '0';
   my $t = '0';
   my $other = '0';
   for (my $i = 0; $i < length($seq); ++$i){
      my $base = substr($seq,$i,1);
      if ($base =~ /a/i){
         ++$a;
      } elsif ($base =~ /c/i){
         ++$c;
      } elsif ($base =~ /g/i){
         ++$g;
      } elsif ($base =~ /t/i){
         ++$t;
      } else {
         ++$other;
      }
   }
   return ($a,$c,$g,$t,$other);
}

Running the Perl script on all the fasta files:

for file in `ls *.fa`; do echo $file; tally_nuc.pl $file; done

mm9_refgene_090212_3_utr.fa
Length = 32985433
A: 9238398
C: 7238453
G: 7238331
T: 9270249
Other: 2
GC: 43.89
AT: 56.11
mm9_refgene_090212_5_utr.fa
Length = 7543736
A: 1654667
C: 2115587
G: 2124322
T: 1649160
Other: 0
GC: 56.20
AT: 43.80
mm9_refgene_090212_coding_exon.fa
Length = 44137007
A: 10628969
C: 11450211
G: 11459637
T: 10598190
Other: 0
GC: 51.91
AT: 48.09

In summary, the 5' UTR has the highest GC content (GC: 56.20 and AT: 43.80), the coding exons have no particular bias (GC: 51.91 and AT: 48.09) and the 3' UTR has the lowest GC content (GC: 43.89 and AT: 56.11) of the three categories.

Bidirectional genes

Download 5' UTR for all RefSeq genes using the UCSC Table Browser.

Separate features according to strand

cat hg19_refgene_five_utr_110914.bed | perl -nle '@a = split; print if $a[5] eq "+";' > hg19_refgene_five_utr_110914_plus.bed
cat hg19_refgene_five_utr_110914.bed | perl -nle '@a = split; print if $a[5] eq "-";' > hg19_refgene_five_utr_110914_neg.bed

Use intersectBed to find overlapping features

#Force strandedness as a test, should have no output
intersectBed -s -a hg19_refgene_five_utr_110914_neg.bed -b hg19_refgene_five_utr_110914_plus.bed
intersectBed -wo -a hg19_refgene_five_utr_110914_neg.bed -b hg19_refgene_five_utr_110914_plus.bed > overlap
cat overlap | perl -nle '@a = split; $t{$a[3]} = '1'; $t{$a[9]} = '1'; END {print join("\n",keys %t)};' | cut -f1,2 -d'_' > unique

Performing a GO enrichment analysis on the unique list of bidirectional genes and using all the genes as the universe list:

R version 2.13.0 (2011-04-13)
Copyright (C) 2011 The R Foundation for Statistical Computing
ISBN 3-900051-07-0
Platform: x86_64-unknown-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> library("GO.db")
> library("GOstats")
> entrezUniverse=scan("universe2")
> selectedEntrezIds=scan("entrez")
> hgCutoff = 0.001
> #Biological Process
> params = new("GOHyperGParams",geneIds=selectedEntrezIds,universeGeneIds=entrezUniverse,ontology="BP",pvalueCutoff=hgCutoff,conditional=TRUE,testDirection="
over",annotation="org.Hs.eg.db")
> hgOver=hyperGTest(params)
> summary(hgOver)
       GOBPID       Pvalue OddsRatio    ExpCount Count Size
1  GO:0034641 2.845531e-06  1.675853 116.7780905   157 5140
2  GO:0090304 4.877145e-06  1.688666  90.8551719   128 3999
3  GO:0044260 5.050034e-05  1.551424 137.7708834   173 6064
4  GO:0022403 1.093607e-04  2.174322  16.4261789    33  723
5  GO:0016568 2.753699e-04  2.582012   8.3153271    20  366
6  GO:0006996 2.867515e-04  1.923650  22.5838527    40 1037
7  GO:0006281 3.059896e-04  2.631828   7.7473402    19  341
8  GO:0071841 3.514806e-04  1.571534  61.7969662    87 2720
9  GO:0044238 3.720664e-04  1.481915 184.6411559   215 8127
10 GO:0000075 3.912858e-04  3.099874   4.8619672    14  214
11 GO:0051329 4.542075e-04  2.618305   7.3611092    18  324
12 GO:0034622 4.721711e-04  2.353943   9.9965681    22  440
13 GO:0044248 4.733327e-04  1.759037  30.1941794    49 1329
14 GO:0006399 4.960020e-04  3.894081   2.7944952    10  123
15 GO:0006839 4.964491e-04  4.805971   1.8402773     8   81
16 GO:0010564 5.842741e-04  2.558462   7.5201455    18  331
17 GO:0007049 5.963847e-04  1.790438  26.5136248    44 1167
18 GO:0000387 6.138504e-04  8.383672   0.7043037     5   31
19 GO:0006368 7.234803e-04  5.192143   1.4994852     7   66
20 GO:0051276 7.455445e-04  2.080239  13.8588784    27  610
21 GO:0006974 8.086037e-04  3.376597   3.5085746    11  160
22 GO:0050434 8.996583e-04  5.955524   1.1359736     6   50
23 GO:0051028 9.584968e-04  3.873584   2.5218615     9  111
24 GO:0010467 9.885920e-04  1.455638  89.2420894   115 3928
                                                              Term
1                     cellular nitrogen compound metabolic process
2                                   nucleic acid metabolic process
3                         cellular macromolecule metabolic process
4                                                 cell cycle phase
5                                           chromatin modification
6                                           organelle organization
7                                                       DNA repair
8  cellular component organization or biogenesis at cellular level
9                                        primary metabolic process
10                                           cell cycle checkpoint
11                                interphase of mitotic cell cycle
12                        cellular macromolecular complex assembly
13                                      cellular catabolic process
14                                          tRNA metabolic process
15                                         mitochondrial transport
16                                regulation of cell cycle process
17                                                      cell cycle
18                                     spliceosomal snRNP assembly
19        transcription elongation from RNA polymerase II promoter
20                                         chromosome organization
21                                 response to DNA damage stimulus
22                      positive regulation of viral transcription
23                                                  mRNA transport
24                                                 gene expression
> #Molecular Function
> params = new("GOHyperGParams",geneIds=selectedEntrezIds,universeGeneIds=entrezUniverse,ontology="MF",pvalueCutoff=hgCutoff,conditional=TRUE,testDirection="over",annotation="org.Hs.eg.db")
> hgOver=hyperGTest(params)
> summary(hgOver)
      GOMFID       Pvalue OddsRatio    ExpCount Count Size
1 GO:0003676 0.0001312477  1.664829 53.28987613    79 2437
2 GO:0016206 0.0005217889       Inf  0.04574949     2    2
3 GO:0003723 0.0008795796  1.909603 18.43704529    33  806
4 GO:0050662 0.0009527279  3.105508  4.14032903    12  181
                                   Term
1                  nucleic acid binding
2 catechol O-methyltransferase activity
3                           RNA binding
4                      coenzyme binding
> #Cellular component
> params = new("GOHyperGParams",geneIds=selectedEntrezIds,universeGeneIds=entrezUniverse,ontology="CC",pvalueCutoff=hgCutoff,conditional=TRUE,testDirection="over",annotation="org.Hs.eg.db")
> hgOver=hyperGTest(params)
> summary(hgOver)
       GOCCID       Pvalue OddsRatio     ExpCount Count  Size
1  GO:0043227 8.095989e-15  2.357756 193.81366272   266  8649
2  GO:0043229 1.078643e-14  2.445344 215.07960860   285  9598
3  GO:0005622 1.268144e-13  2.727612 259.96442377   320 11601
4  GO:0031974 7.622271e-13  2.476594  50.44219618   102  2251
5  GO:0070013 1.297462e-12  2.479110  48.64949263    99  2171
6  GO:0005634 9.443707e-12  2.053406 120.73858420   183  5388
7  GO:0044422 3.286750e-11  2.013652 121.41084803   182  5418
8  GO:0005654 6.334903e-08  2.336166  28.48157768    59  1271
9  GO:0043228 4.001227e-06  1.784747  58.62140614    92  2616
10 GO:0005730 1.096997e-05  2.682435  11.33884996    28   506
11 GO:0005694 1.166634e-05  2.622132  12.01111380    29   536
12 GO:0043234 5.725214e-05  1.662474  59.30486691    88  2712
13 GO:0000151 6.952506e-05  4.226438   3.11482242    12   139
14 GO:0005739 8.158783e-05  1.881356  29.46756463    51  1315
15 GO:0005737 1.050799e-04  1.488375 182.27313361   218  8134
16 GO:0015630 3.948281e-04  2.107557  14.67776033    29   655
17 GO:0000775 4.285894e-04  3.660054   3.24927519    11   145
18 GO:0005684 5.008308e-04       Inf   0.04481759     2     2
19 GO:0080008 7.364930e-04 11.749319   0.42576709     4    19
                                 Term
1          membrane-bounded organelle
2             intracellular organelle
3                       intracellular
4             membrane-enclosed lumen
5       intracellular organelle lumen
6                             nucleus
7                      organelle part
8                         nucleoplasm
9      non-membrane-bounded organelle
10                          nucleolus
11                         chromosome
12                    protein complex
13           ubiquitin ligase complex
14                      mitochondrion
15                          cytoplasm
16           microtubule cytoskeleton
17     chromosome, centromeric region
18       U2-type spliceosomal complex
19 CUL4 RING ubiquitin ligase complex
> #KEGG
> params = new("KEGGHyperGParams",geneIds=selectedEntrezIds,universeGeneIds=entrezUniverse,pvalueCutoff=hgCutoff,testDirection="over",annotation="org.Hs.eg.db")
> hgOver=hyperGTest(params)
> summary(hgOver)
  KEGGID       Pvalue OddsRatio ExpCount Count Size          Term
1  03013 0.0003025115  3.918123 3.174733    11  152 RNA transport
>

Although this was a brief analysis, the results are somewhat similar to the findings in the paper Trinklein et al., 2004 (An Abundance of Bidirectional Promoters in the Human Genome). Findings from the paper include:

1. DNA-repair genes are more than fivefold overrepresented in the bidirectional class.
2. Chaperone proteins are almost threefold overrepresented
3. Mitochondrial genes are more than twofold overrepresented

RefSeq promoters

Is there any nucleotide bias with the -40 region of RefSeqs?

Taking all hg19 RefSeqs that mapped to assembled chromosomes (36,004) and extracting the nucleotide sequences 40 bp upstream of the RefSeq gene model, I generated a sequence logo.

No obvious TATA box enrichment, which was expected since only 10-20% of genes in eukaryotes have a TATA box (perhaps at -13 to -16?). Note the enrichment of a cytosine at -1.

Then I took the -20 and +20 sequences and generated the same sequence logo plot.

Note the enrichment of purines (adenine and guanine) at the 5' UTR start (position 21).