Transcription factor binding site analysis

Updated 2013 October 4th. Recently I’ve been looking into transcription factor binding site analyses. With my mind set on this, I thought I’ll brush up this old post.

MEME is a tool for discovering motifs in a group of related DNA or protein sequences.

As a discovery tool, it is able to find de novo motifs. As kind of a silly test for this software, I wrote a Perl script that inserts a motif randomly in a set of random sequences.

#!/bin/env perl

use strict;
use warnings;

#define some parameters
my $motif = 'ATTCCGGA';
my $background_seq = '200';
my $background_seq_len = '400';
my $motif_length = length($motif);
my $limit = $background_seq_len - $motif_length;

#set seed to for reproducibility
srand(1234);

#loop through the number of random sequences to generate
for (1 .. $background_seq){
   my $random_seq = random_seq($background_seq_len);
   #pick a random position in the random sequence
   my $random_position = int(rand($limit));
   #substitute the bases at the random position with the motif
   substr($random_seq,$random_position,$motif_length) = $motif;
   print ">$_\n";
   print "$random_seq\n";
}

#this subroutine takes as input the length of the sequence to generate
#and chooses a nucleotide at random
#and appends them together until it matches the length specified
sub random_seq {
   my ($len) = @_;
   my @nuc = ('A','C','G','T');
   my $seq = '';
   for (1 .. $len){
      my $random_index = int(rand(scalar(@nuc)));
      $seq .= $nuc[$random_index];
   }
   return $seq;
}

Using the above script, we will generate 200 sequences that are 400 bp in length. Within each of these 200 sequences is the motif “ATTCCGGA”, that has been randomly inserted.

Now to set up MEME:

wget http://ebi.edu.au/ftp/software/MEME/4.9.1/meme_4.9.1.tar.gz
tar xzf meme_4.9.1.tar.gz
cd meme_4.9.1
./configure --prefix=$HOME/meme --with-url=http://meme.nbcr.net/meme
make 
make test 
make install

Now to generate the random sequence, with the embedded motif:

random_motif.pl > test.fa
head -4 test.fa
>1
GACCTCAGGCCCCCAGGCTTGGTGTCGGAGAGCGGAGCAAGAAGTCGTAGCGAAACCAAGATGATGCAAGCTAATGAACGGGAATTTGGAAAAGGTTAAAGCTGATAAACGGGTAGGCTACCACCGGGTGTGATAGGGATGGTGGTATCGAGCCGAATATATAGCCTCCGGGTCACCGCGCACCGCATTCCGGATTCAGCTATAAAGATGCCAACCTGGCGCGCTTCGGGGATTACGTATTAGAGACAGCAAGCGACTTGGATCTGGGTAGTACTCGCGCCCCGATATAAACTTTCGCCGCTTTAACTCAGGATCCCATAACCGGGTCAGTCGTAACAATGTGTGTCAGGTTTGCCCCTCTACAAAATAACAAAATCGTTACGTCATTGGGAGGCGACTG
>2
TTTCTGATAGTCGCATCGCGCCAGAGGGTACCCTGGGTATGGAGCGCAGGTGTTGCCCCAACTTATTCTTCCTCTTTTAAGGTCTAGGCTCTTCCTTTGAATCCCATATCTCCGTTAAATCCTTGACGTAGAATAAAACCGCCTAGGATGGCGCTTCGGTATGCTCGACACAATTTTACCTGTGGGGATATAGCATTATCGTTTCATTCGCGGATGCCCTATATGCATCGTCGGCTTACAAGTACCTGTGATTCCCCTTCGGGGCCTTATCCCCTAGGGTGCACCTATTGCGTCGTCGTAGGCTGACAAGTACAAGTCGTTTTACGCTGCCATTGCGGTGCATTCCGGAACGTTCGGCCTGCCCAGCGACCAGAATTCCAAGAGGAGCGTACAGGGTGGA

time meme test.fa
real    4m59.825s
user    4m58.741s
sys     0m0.270s

#showing parts of the results
cat meme_out/meme.txt

********************************************************************************
MOTIF  1        width =    8   sites = 163   llr = 1808   E-value = 1.9e-134
********************************************************************************
         bits   16.3
                14.7
                13.0
                11.4
Relative         9.8
Entropy          8.1
(16.0 bits)      6.5
                 4.9
                 3.3
                 1.6 ********
                 0.0 --------

Multilevel           ATTCCGGA
consensus
sequence

MEME finds the ATTCCGGA motif I inserted:

logo1_eps

Clover

I’d have to use a different tool instead, if I had a library of sequence motifs, such as transcription factor binding sites, and I wanted to find these sites in a set of sequences. Clover is such a program and specifically:

Clover is a program for identifying functional sites in DNA sequences. If you give it a set of DNA sequences that share a common function, it will compare them to a library of sequence motifs (e.g. transcription factor binding patterns), and identify which if any of the motifs are statistically overrepresented in the sequence set.

To set up Clover:

wget http://zlab.bu.edu/~mfrith/downloads/clover-2011-10-24.tar.gz
tar xzf clover-2011-10-24.tar.gz
cd clover-2011-10-24/
make

The usage of Clover is:

Usage summary: clover [options] mymotifs myseqs.fa [BGfiles]

It is recommended to use a background to contrast with the test sequences; they should come from the same taxonomic group and have similar GC compositions. So below is a script that simply takes in a fasta file, counts the nucleotide composition, multiplies the count by a factor and concatenates all the sequences. Finally the concatenated sequence is scrambled and this becomes the background.

#!/bin/env perl

use strict;
use warnings;

my $usage = "Usage: $0 <infile.fa> <factor>\n";
my $infile = shift or die $usage;
my $factor = shift or die $usage;

my $seq = '';
open(IN,'<',$infile) || die "Could not open $infile: $!\n";
while(<IN>){
   chomp;
   next if (/^>/);
   $seq .= $_;
}
close(IN);

my ($a, $c, $g, $t) = (0,0,0,0);

for(my $i=0; $i<length($seq); ++$i){
   my $base = substr($seq,$i,1);
   if ($base =~ /a/i){
      ++$a;
   } elsif ($base =~ /c/i){
      ++$c;
   } elsif ($base =~ /g/i){
      ++$g;
   } elsif ($base =~ /t/i){
      ++$t;
   } else {
      warn "Base $i is not ACGT\n";
   }
}

#print join ("\t",$a,$c,$g,$t),"\n";
$a = $a * $factor;
$c = $c * $factor;
$g = $g * $factor;
$t = $t * $factor;

my $new_seq = 'A' x $a;
$new_seq .= 'C' x $c;
$new_seq .= 'G' x $g;
$new_seq .= 'T' x $t;

$new_seq = scramble($new_seq);
print ">1\n$new_seq\n";

exit(0);

#source http://www.pitt.edu/~mcs2/teaching/perl/code/random.txt
sub scramble{
    my $input = shift;
    my $scramble = '';
    my @input = split(//, $input);
    while(@input){
        my $index = int rand @input;
        $scramble .= $input[$index];
        splice(@input, $index, 1);
    }
    return $scramble;
}

Now to run the script:

make_background.pl test.fa 2 > bg.fa

Now we require the motif file, in this format:

>TATA
0  0 0 10
10 0 0 0
0  0 0 10
10 0 0 0

#download transfac2clover.py
#to convert the TRANSFAC's matrix.dat to Clover's format
wget http://zlab.bu.edu/~mfrith/downloads/transfac2clover.py
python transfac2clover.py ~/transfac/dat/matrix.dat > converted_transfac_matrices

Now open up the converted_transfac_matrices file and manually enter random motif we introduced into the random sequence file:

>ATTCCGGA
31      0       0       0
0       0       0       31
0       0       0       31
0       31      0       0
0       31      0       0
0       0       31      0
0       0       31      0
31      0       0       0

And now to run Clover:

#to save time just test the four motifs
cat converted_transfac_matrices | head -49 > test_motif
time clover test_motif test.fa bg.fa > test_clover.out
cat test_clover.out | head -20
Clover: Cis-eLement OVERrepresentation
Compiled on Oct  4 2013

Sequence file: test.fa (200 sequences, 80000 bp, 49.7% C+G)
Motif file: test_motif (4 motifs)

*** Over- and Under-represented Motifs:

Motif     Raw score  P-value from randomizing
                     bg.fa
ATTCCGGA  inf        0

*** Motif Instances with Score >= 6:

Motif     Location   Strand  Sequence  Score

>1
ATTCCGGA  187 - 194    +     attccgga  10.9
ATTCCGGA  189 - 196    -     tccggatt  6.67

Clover was able to find the ATTCCGGA, however I only tested 4 motifs. To test my full collection of motifs in a reasonable time, I split the motifs into 16 files using this script:

#!/bin/env perl

use strict;
use warnings;

my $usage = "Usage: $0 <motif> <divisions>\n";
my $infile = shift or die $usage;
my $div = shift or die $usage;

#count the number of entries
chomp(my $no_of_entry = `cat $infile | grep "^>" | wc -l`);

#divide the number of entries
#round this number and add 1
my $threshold = sprintf("%d", $no_of_entry / $div) + 1;

my %store = ();
my $current_id = '';

open(IN,'<',$infile) || die "Could not open $infile: $!\n";
while(<IN>){
   chomp;
   next if (/^$/);
   next if (/^#/);
   if (/^>/){
      $current_id = $_;
      next;
   }
   $store{$current_id} .= "$_\n";
}
close(IN);

my $counter = 0;
my $split_counter = 0;
my $outfile = $infile . "_$split_counter";
open(OUT,'>',$outfile) || die "Could not open $outfile for writing: $!\n";

foreach my $id (keys %store){
   ++$counter;
   print OUT "$id\n$store{$id}";
   if ($counter == $threshold){
      close(OUT);
      ++$split_counter;
      $outfile = $infile . "_$split_counter";
      open(OUT,'>',$outfile) || die "Could not open $outfile for writing: $!\n";
      $counter = 0;
   }
}
close(OUT);

exit(0);

Now to split the motifs and to run Clover in parallel:

split_motif.pl converted_transfac_matrices 16
parallel clover converted_transfac_matrices_{} test.fa bg.fa '>' test_clover_{}.out ::: {0..15}

Using ENCODE ChIP-Seq data

Let’s test Clover with some ChIP-Seq data; I will use ChIP-Seq data from ENCODE on a K562 cell line for the c-Fos transcription factor. My expectation is that ChIP-Seq peaks should be enriched with the known transcription factor binding site of c-Fos.

Firstly download the file:

wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeSydhTfbs/wgEncodeSydhTfbsK562CfosStdPk.narrowPeak.gz

The file format is explained here. I will just take the top 20 peaks with the highest signal value [measurement of overall (usually, average) enrichment for the region] and convert it into a fasta file.

zcat ./wgEncodeSydhTfbsK562CfosStdPk.narrowPeak.gz | sort -k7rn | head -100 > k562_cfos.bed
fastaFromBed -fi ~/genome/hg19.fa -bed k562_cfos.bed -fo k562_cfos.fa
cat k562_cfos.fa | head -2
>chr17:44439034-44439404
GACTCACCTGTTGCCACTCAAGGTACCGCTACCTACAACACCGCCGACTCCGCGGCCTTTAGGATTTCAGCTCAGTTCAGCTAAACCACGACAGGCGTGGGGGCAGGAACAGCAACCAACCAATCACCCACCGCCTCCTGGAGCTTTTGCACCAATGAGCTCGAAGTTTTGTGAGTGACGACATATCTGGCCAATGCAAAGAAGAGTAAGGGCCTGGGAGGGAGGGAGCGCCGTAGGCGACGCCATGAAGCTCTGGCAATACCATGTTCATCTTCAAATCACAGTTAAACGAATTCTGGCGAGACACGCCCACGTCCCCCACCCCCGATGCCATGTGCGACCAATCAGAAAAGCAAAAGGATTGTCTATT

Now running Clover using human chr20 as the background:

wget http://zlab.bu.edu/~mfrith/downloads/hs_chr20.mfa.gz
gunzip hs_chr20.mfa.gz
time clover converted_transfac_matrices k562_cfos.fa hs_chr20.mfa > k562_cfos_clover.out
real    89m30.108s
user    89m27.402s
sys     0m1.468s
cat k562_cfos_clover.out
Clover: Cis-eLement OVERrepresentation
Compiled on Oct  4 2013

Sequence file: k562_cfos.fa (100 sequences, 50774 bp, 60.6% C+G)
Motif file: converted_transfac_matrices (2202 motifs)

*** Over- and Under-represented Motifs:

Motif                                              Raw score  P-value from randomizing
                                                              hs_chr20.m
M00287 V$NFY_01 NF-Y                               326        0
M02106 V$NFYA_Q5 NF-YA                             308        0
M00775 V$NFY_Q6_01 NF-Y                            251        0
M00185 V$NFY_Q6 NF-Y                               221        0
M00254 V$CAAT_01 CCAAT box                         216        0
M00288 F$HAP234_01 HAP2/3/4                        204        0
M02434 F$HAP4_01 HAP4                              194        0
M02107 V$NFYC_Q5 NF-YC                             180        0
M00209 V$NFY_C NF-Y                                164        0
M00687 V$ALPHACP1_01 alpha-CP1                     151        0
__snipped__

The top four hits correspond to NFYA. Here’s the sequence logo constructed using R for M00287:

#install if necessary
source("http://bioconductor.org/biocLite.R")
biocLite("seqLogo")
 
library(seqLogo)
a <- c(52,22,16,80,68,0,2,160,164,0,20,96,21,58,47,34)
c <- c(47,51,67,19,7,164,161,0,0,1,88,10,28,57,11,53)
g <- c(41,38,40,60,78,0,1,1,0,0,50,55,98,33,72,40)
t <- c(22,52,41,5,11,0,0,3,0,163,6,3,17,16,34,35)
df <- data.frame(a,c,g,t)
df
     a   c  g   t
1   52  47 41  22
2   22  51 38  52
3   16  67 40  41
4   80  19 60   5
5   68   7 78  11
6    0 164  0   0
7    2 161  1   0
8  160   0  1   3
9  164   0  0   0
10   0   1  0 163
11  20  88 50   6
12  96  10 55   3
13  21  28 98  17
14  58  57 33  16
15  47  11 72  34
16  34  53 40  35

#define function that divides the frequency by the row sum i.e. proportions
proportion <- function(x){
   rs <- sum(x);
   return(x / rs);
}
 
#create position weight matrix
pwm <- apply(df, 1, proportion)
pwm <- makePWM(pwm)
seqLogo(pwm)

M00287

Here’s what the Fos binding site looks like:

MA0099.1_BIG

Not quite sure on the results, but the CCAAT motif seems to be enriched despite being quite different from the Fos motif.

Let’s try again with a Stat3 ChIP-Seq dataset on the Gm12878 cell line (note I don’t sort by score because the peak file is already sorted by significance):

wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeSydhTfbs/wgEncodeSydhTfbsGm12878Stat3IggmusPk.narrowPeak.gz
zcat wgEncodeSydhTfbsGm12878Stat3IggmusPk.narrowPeak.gz | head -100 > gm12878_stat3.bed
fastaFromBed -fi ~/genome/hg19.fa -bed gm12878_stat3.bed -fo gm12878_stat3.fa
time clover converted_transfac_matrices gm12878_stat3.fa hs_chr20.mfa > gm12878_stat3_clover.out
real    276m42.550s
user    276m38.408s
sys     0m0.492s
cat gm12878_stat3_clover.out

lover: Cis-eLement OVERrepresentation
Compiled on Oct  4 2013

Sequence file: gm12878_stat3.fa (100 sequences, 160944 bp, 44.5% C+G)
Motif file: converted_transfac_matrices (2202 motifs)

*** Over- and Under-represented Motifs:

Motif                                              Raw score  P-value from randomizing
                                                              hs_chr20.m
M00972 V$IRF_Q6_01 IRF                             105        0
M01172 V$PU1_Q4 PU.1                               96         0.002
M00772 V$IRF_Q6 IRF                                93.3       0
M01881 V$IRF1_Q6_01 IRF-1                          88.3       0
M01882 V$IRF2_Q6 IRF-2                             73.3       0
M02916 V$SRF_06 Srf                                59.8       0.998
M01718 V$NFAT2_Q5 NF-AT2                           56.8       0
M00500 V$STAT6_02 STAT6                            56.5       0
__snipped__

Firstly how does the STAT3 sequence logo look?

data <- scan()
16.1    20.3    30.7    32.8
2.4     18      59.1    20.3
9.8     17.5    39.4    33.3
5       45.2    27.9    21.9
77.9    0.8     21.3    0
12.6    6.8     0       80.6
0       0       0       100
0       0       0       100
0       100     0       0
0       100     0       0
0       50      50      0
0       0       100     0
0       0       98.2    1.8
100     0       0       0
100     0       0       0
63.3    0       32.7    4
1.8     13.8    1.8     82.6
7.7     19.3    70      3
34.7    18.7    28      18.7
27.3    18.2    22.4    32.1
19.6    24.3    16      40.1
library(seqLogo)
Loading required package: grid
proportion <- function(x){
    rs <- sum(x);
    return(x / rs);
}
df <- matrix(data=data, ncol=4, byrow=T)
pwm <- apply(df, 1, proportion)
pwm <- makePWM(pwm)
seqLogo(pwm)

stat3

And IRF?

irf

And PU.1?

pu.1

It’s known that PU.1 interacts with the Interferon Regulatory Factor (IRF). STAT3 is activated through phosphorylation of tyrosine 705, in response to various cytokines and growth factors including interferons. It’s interesting that using the top 100 most significant peaks from the Stat3 ChIP-Seq experiment, would return transcription factor binding sites for PU.1 and IRF.

Print Friendly, PDF & Email



Creative Commons License
This work is licensed under a Creative Commons
Attribution 4.0 International License
.
7 comments Add yours
  1. What does that mean when the raw score is equal to inf? I have some result like this and I don’t know really how to interpret them. Thank you very much for your help!

    1. Hi nanac,

      This is directly from Clover’s manual:

      Clover will compare each motif in turn to the sequence set, and calculate a “raw score” indicating how strongly the motif is present in the sequence set. Raw scores by themselves are hard to interpret, so Clover provides options (which we recommend you use) to determine the statistical significance of the raw scores.

      So in short, use the p-values. I’m sure if you’re really interested, you can email Martin Frith (the author of Clover). Just look him up on the web.

      Cheers,

      Dave

  2. Thanks for your answer, then I just focus on the p-value. But just a little other question, actually about this p-value.
    If it’s equal to 0?? I’m not sure if I have to consider that this motif is really over-represented or that it is an incorrect value and I don’t have to take into account this motifs…
    Thank you in advance!

    Cheers

    1. Hi nanac,

      My interpretation of the p-values that equal 0 is just that Clover probably “rounds” the numbers e.g. 0.00001 is rounded into 0. So they are statistically significant and should be considered.

      From the manual again:

      Clover prints details for statistically significant motifs (all P-values <= some threshold, by default 0.01), and then finds instances of these motifs in the sequences. So your output already contains the list of motifs that are statistically significant at a p-value <= 0.01, assuming you used the default setting. Hope that helps! Dave

  3. Hi Dave,
    I have a bunch of sequence, upstream [-35 -> -20] of TSSs and they are enriched for TATA-like(noncanonical TATA) motif, as will be clearly seen in weblogo.

    Can you suggest me how to get %similarities of these individual sequences with respect to the canonical TATA motif from Jaspar, so that i could plot the hist of %similarities. How many of these sequences have canonical TATA, ie 100% match with JASPAR matrix.

    thanks in advance !!

    This is the seq, which you paste at
    http://weblogo.threeplusone.com/
    to see the enriched logo

    >Seq1
    GTTGGGAAGCGA
    >Seq2
    TTTTTAGGGTAA
    >Seq3
    TTAGAGAAGAGG
    >Seq4
    TTTTTTTTTTCT
    >Seq5
    CGTATAAAGACC
    >Seq6
    CTTTACGAACGA
    >Seq7
    CGTTTCACAGTC
    >Seq8
    GTCTTGGTAATG
    >Seq9
    GCTATAAATAGG
    >Seq10
    ACTATAAATCCA
    >Seq11
    GACATTTCCCAG
    >Seq12
    GGCCAATGAGTG
    >Seq13
    GCGTCATTATAG
    >Seq14
    CGTCATATCCTC
    >Seq15
    TGTTATAGAACC
    >Seq16
    CCTGATAGGTCG
    >Seq17
    TACAAGAAGCAC
    >Seq18
    GGCATAAATAGG
    >Seq19
    CGTCATCTAGAA
    >Seq20
    TAGGTACACGAA
    >Seq21
    ATATATAAGCCC
    >Seq22
    CTGTTTTGTTTT
    >Seq23
    GATGTATATGAG
    >Seq24
    TAAATACCTCTA
    >Seq25
    TGGGACTTTTGC
    >Seq26
    AATGACGTCATC
    >Seq27
    GAGAAATGGCGT
    >Seq28
    TCCAACAAAGCT
    >Seq29
    TAGCCGGTAGAG
    >Seq30
    TTTGTTTTTGTC
    >Seq31
    AGTCTTTAAAGG
    >Seq32
    TGGTTTTAGAGC
    >Seq33
    ATCTTGAATAGT
    >Seq34
    ATAAAAGAGCAA
    >Seq35
    AGAGTAATCGAT
    >Seq36
    AGGATTTGTCAC
    >Seq37
    CGTTGTCATAGT
    >Seq38
    TTATATAGTTTA
    >Seq39
    TGCACAGTCACA
    >Seq40
    AACTATATAAAA
    >Seq41
    TTTCAAGATGTC
    >Seq42
    TGTTTTTGTTAG
    >Seq43
    ACCTCATTTTGA
    >Seq44
    CTTAATTTGAAG
    >Seq45
    AACATGTTAAAG
    >Seq46
    TCCTCTTATCAA
    >Seq47
    TAAAAGTGAAAC
    >Seq48
    CGTAATCAACAC
    >Seq49
    CCATAGTAACAT
    >Seq50
    AGCGTCGTCGAG
    >Seq51
    GTTTTAACTTCA
    >Seq52
    TAGCTTGTGACA
    >Seq53
    AGGAAACAAGCA
    >Seq54
    ATCTCATAGGCT
    >Seq55
    TACTTCAACAGC
    >Seq56
    CGAAGGGACCTG
    >Seq57
    AGTGATATTATC
    >Seq58
    ACGGAAGTAGTG
    >Seq59
    ACGTATATAGCG
    >Seq60
    AGTTTATTTAGA
    >Seq61
    AGACACACCTCC
    >Seq62
    ATTATTAGAATA
    >Seq63
    ACTGACCTCTTC
    >Seq64
    TTTACATAACTC
    >Seq65
    TGCTCTGATTGG
    >Seq66
    ATCGCATTAGTA
    >Seq67
    ACAACACAACAG
    >Seq68
    CGAATAAATGAC

    1. Hi Chirag,

      first question is why are your sequences 12 bp long when you said you took [-35 -> -20]? One crude way of doing what you asked is to take the JASPAR PFM for TBP (MA0108.2 and MA0386.1), convert them into PWMs as per this post http://davetang.org/muse/2013/10/01/position-weight-matrix/ and simply take the sum. However this requires the sequence to be the same length and does not perform any alignment between the input and PWM.

      I also tried to run TESS on your sequences to the JASPAR PFMs. Even when I lowered the threshold for outputting results (-mlo 0), I couldn’t see any results for TBP. However, if I use the TRANSFAC PFMs, 54 of your 68 sequences have a log-odds ratio greater than one. For example Seq9 matches maximally to M00471.

      I am definitely no expert at this; perhaps Albin could help you out on this.

      Good luck!

      Dave

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.