Transcription factor binding site prediction

Updated 2018 November 7th

I wrote this post back in 2013 with the goal of finding a tool that can output a list of motifs within my sequence of interest. Five years later, I think the easiest tool for performing this task is HOMER. Being the huge Simpsons fan that I am, I’m surprised I didn’t check HOMER out back then. In addition, some of the tools I tested back then no longer exist (such as TFSEARCH), which is why I decided to update this post.

To get started, let’s get HOMER up and running.

mkdir homer
cd homer
wget http://homer.ucsd.edu/homer/configureHomer.pl
perl configureHomer.pl -install

We want to use findMotifs.pl to finding instances of motifs with a sequence. The usage when using FASTA files is:

# findMotifs.pl <targetSequences.fa> fasta <output directory> -fasta <background.fa> [options] -find motif1.motif > outputfile.txt
findMotifs.pl targets.fa fasta motifResults/ -fasta background.fa -find my_motif.txt > my_result.txt

First we will use seq2profile.pl to create a HOMER motif file.

seq2profile.pl GCATAAAAAA > my_motif.txt

Check out the motif file.

cat my_motif.txt 
>GCATAAAAAA     GCATAAAAAA      13.8228985209959
0.001   0.001   0.997   0.001
0.001   0.997   0.001   0.001
0.997   0.001   0.001   0.001
0.001   0.001   0.001   0.997
0.997   0.001   0.001   0.001
0.997   0.001   0.001   0.001
0.997   0.001   0.001   0.001
0.997   0.001   0.001   0.001
0.997   0.001   0.001   0.001
0.997   0.001   0.001   0.001

I’ve create a FASTA file with the motif embedded between some random nucleotides and I created a random background file by mashing A, C, G, and T on my keyboard.

cat test.fa 
>test
TTTTTTTTTTAAAAAAAAAAGCATAAAAAACCCCCCCCCCGGGGGGGGGG

cat bg.fa 
>background
AGTCAGCTAGCTAGCTAGCTACTAGC
TAGCTAGCTAGCTAGCTAGCTAGTCG
ATCGATCGATTATATCTACGATCATG
CTACGTAGCTAGCTAGTCGATCGTAC
GTACGTACGTGACTGACGATCGTGCA

Now we can run the findMotifs.pl script.

findMotifs.pl test.fa fasta result_dir/ -fasta bg.fa -find my_motif.txt > my_result.txt

Selected Options:
        Input file = test.fa
        Promoter Set = fasta
        Output Directory = result_dir/
        Will use FASTA files for motif finding
                Target Sequences = test.fa
                Background Sequences = bg.fa
        Will find motif(s) in my_motif.txt
        Using custom gene IDs for GO analysis
        Parsing FASTA format files...
        Found 1 sequences

        Reading input files...
        1 total sequences read
        1 motifs loaded
        Finding instances of 1 motif(s)
        |0%                                    50%                                  100%|
        =================================================================================

cat my_result.txt 
FASTA ID        Offset  Sequence        Motif Name      Strand  MotifScore
test    -5      GCATAAAAAA      GCATAAAAAA      +       13.832899

The columns are self-explanatory except for the “Offset” and the motif score columns. The offset is the position of the found motif from the transcription start site (TSS) and the TSS is assumed to be the middle of the input FASTA file:

TTTTTTTTTTAAAAAAAAAAGCATA – AAAAACCCCCCCCCCGGGGGGGGGG

If I add another 10 nucleotides to the end of test.fa, then the offset will be -10.

cat test.fa 
>test
TTTTTTTTTTAAAAAAAAAAGCATAAAAAACCCCCCCCCCGGGGGGGGGGAAAAAAAAAA

findMotifs.pl test.fa fasta result_dir/ -fasta bg.fa -find my_motif.txt 

Selected Options:
        Input file = test.fa
        Promoter Set = fasta
        Output Directory = result_dir/
        Will use FASTA files for motif finding
                Target Sequences = test.fa
                Background Sequences = bg.fa
        Will find motif(s) in my_motif.txt
        Using custom gene IDs for GO analysis
        Parsing FASTA format files...
        Found 1 sequences

        Reading input files...
        1 total sequences read
        1 motifs loaded
        Finding instances of 1 motif(s)
        |0%                                    50%                                  100%|
        =================================================================================
FASTA ID        Offset  Sequence        Motif Name      Strand  MotifScore
test    -10     GCATAAAAAA      GCATAAAAAA      +       13.832899

Finally, the motif score column is the log odds score of the motif matrix. To understand how this score is calculated, we need to examine our Position Weight Matrix, which is:

>GCATAAAAAA     GCATAAAAAA      13.8228985209959
0.001   0.001   0.997   0.001
0.001   0.997   0.001   0.001
0.997   0.001   0.001   0.001
0.001   0.001   0.001   0.997
0.997   0.001   0.001   0.001
0.997   0.001   0.001   0.001
0.997   0.001   0.001   0.001
0.997   0.001   0.001   0.001
0.997   0.001   0.001   0.001
0.997   0.001   0.001   0.001

The first row is the identifier line, followed by the weights for each nucleotide in the order A, C, G, and T; since our motif is 10 nucleotides long, there are 10 rows. Our sequence is GCATAAAAAA, so we obtain the corresponding weights, which are all 0.997 and divide by 0.25 (the probability of any nucleotide) and calculate the natural logarithm. We multiply this by 10 because our sequence matches all the 0.997 weights in the motif.

log(0.997/0.25)
[1] 1.38329

log(0.997/0.25) * 10
[1] 13.8329

The following text is from my 2013 post…

I have a simple task: given a short DNA sequence and I want to know if there are any potential transcription factor binding sites within this sequence. I looked online and found this transcription factor binding site prediction tool called TFSEARCH. It’s very straight-forward; all you have to do is input a sequence, which may explain its popularity (based on the site’s counter and Google’s pagerank for the site).

TFSEARCH

So I decided to test the tool out by inputting a sequence that matches maximally for the Hunchback transcription factor:

GCATAAAAAA

This is the main output of TFSEARCH after some formatting:

TFMATRIX entries with High-scoring:

GCATAAAAAA    entry         score
--------->    M00022 Hb     100.0
   <------    M00101 CdxA   92.1
   <------    M00100 CdxA   91.0

Total 3 high-scoring sites found.
Max score: 100.0 point, Min score: 91.0 point

Great! It found the Hb transcription factor binding site and gave it a score of 100 as expected. I added some bases that flanked the motif and TFSEARCH could still identify Hb. I added a base in between the motif (not shown), and it still could find the motif, but with a reduced score.

However when I input a sequence that matches maximally for Nanog, GGGCCCATTTCC, I don’t get any matches to Nanog; probably because TFSEARCH does not have the position weight matrix for this TF. To work out if TFSEARCH has Nanog in their database I probed it using their CGI script (http://www.cbrc.jp/htbin/bget_tfmatrix?). The TRANSFAC id for Nanog is M01123, so we can search http://www.cbrc.jp/htbin/bget_tfmatrix?M01123 and indeed it is missing. The TRANSFAC id for Hunchback is M00022, and http://www.cbrc.jp/htbin/bget_tfmatrix?M00022 is available.

Unfortunately there’s no publication associated to TFSEARCH and not much documentation available on the website, so I can’t download the source code and use it with my own set of TRANSFAC binding sites. So I moved along to other online tools that I could find.

PROMO

The next hit returned from Google was PROMO. I searched all species and then proceeded to the “SearchSites” tool and used as input the Hunchback sequence but it was missing from the results file (searchRes.txt). I tried the Nanog motif but it was missing as well. Next!

TFBIND

I came across TFBIND and again I couldn’t get Hunchback or Nanog. Next!

ConSite

I found ConSite and used the “Analyze single sequence” feature. Just like TFSEARCH, it could find Hunchback but not Nanog. Unfortunately it’s built as a web application so it’s not available for download. I scanned the associated publication and ConSite seems to be built using the TFBS Perl module but it seems defunct, since the URL (http://forkhead.cgb.ki.se/TFBS/) is broken. Next!

Match

I tried using the publicly available version of Match but yet again Nanog could not be found.

TESS

I was in despair until I came across TESS, which stands for Transcription Element Search System. It seemed like it was a very popular tool, since it had to be removed from their website. However the authors made the source code available and eagerly I tested it:

wget http://www.cbil.upenn.edu/downloads_orig/TESS/TESS_v1.0.zip
unzip TESS_v1.0.zip
cd TESS_v1.0
make tessWms

According to the README, the tessWms program requires a PWM file; the format is as such:

>PWM_1 defline
1      2      2      0
2      1      2      0
3      0      1      1
0      5      0      0
5      0      0      0
>PWM_2 defline
0      0      4      1
0      1      4      0
0      0      0      5
0      0      5      0
0      1      2      2
0      2      0      3
1      0      3      1

where the columns are observations for A, C, G, and T in that order. The rows can be either counts, normalized counts, or probabilities.

Update: 2013 October 4th. Previously I had prepared a file as described in the README, however, every second entry would be skipped. I contacted the author of the software, and he told me that a “<” is required to end each entry.

So I prepared the motifs in the TRANSFAC matrix.dat file using the following script, which adds the “<” at the end of each entry:

If you run the above script, you can generate a matrix file with the nucleotide frequencies of each motif, which can be used by TESS. Now if I run TESS, voila:

tessWms -mat matrix.tess -seq hb.fa -fmt t
Opened sequence file 'hb.fa'
Opened output file '-'
Opened matrix file 'matrix.tess'
Loaded 2201 matrices.
#La: (or Score): actual log-odds ratio of the match
#Lm: maximum possible log-odds ratio for a match from a given PWM (total information content of the PWM consensus #sequence)
#Ld (or Lm - Score): Lm - La
#t (tab):
#SequenceId      MatrixId        La      HitBegin        HitEnd  Sense   Ld      Sequence        Oligo
Hb      M00022  14.213744       1       10      0       0.000000        GCATAAAAAA
Hb      M00100  6.725922        2       8       1       2.039528        CATAAAA
Hb      M00101  7.055926        2       8       1       1.068455        CATAAAA
Hb      M00471  6.633270        2       9       0       6.317433        CATAAAAA
Hb      M00980  8.376884        1       7       1       2.444785        GCATAAA
Hb      M00980  7.408779        3       9       1       3.412889        ATAAAAA
Hb      M01094  11.683704       2       8       0       0.000000        CATAAAA
Hb      M01097  9.805838        1       10      1       1.635294        GCATAAAAAA
Hb      M01252  6.109876        3       10      1       6.017624        ATAAAAAA
Hb      M01292  10.127290       3       8       0       0.000000        ATAAAA
Hb      M01599  8.138197        3       10      0       3.993047        ATAAAAAA
Hb      M01654  6.096537        5       10      0       4.710679        AAAAAA
Hb      M01800  8.512405        5       10      1       0.000000        AAAAAA
Hb      M01879  6.517766        4       9       1       0.570316        TAAAAA
Hb      M01879  6.299585        5       10      1       0.788496        AAAAAA
Hb      M02024  7.923663        1       10      1       6.564609        GCATAAAAAA
Hb      M02086  9.643571        2       7       1       0.351472        CATAAA
Hb      M02087  9.347031        2       7       1       0.167457        CATAAA
Hb      M02355  6.413547        2       9       1       4.743982        CATAAAAA
Hb      M03564  9.589494        3       8       1       0.000000        ATAAAA
Hb      M03573  6.982671        3       10      1       6.409391        ATAAAAAA

#and now Nanog
tessWms -mat matrix.tess -seq nanog.fa -fmt t
Opened sequence file 'nanog.fa'
Opened output file '-'
Opened matrix file 'matrix.tess'
Loaded 2201 matrices.
Nanog   M00313  6.493833        3       10      0       2.808781        GCCCATTT
Nanog   M00314  6.679016        3       10      0       2.726591        GCCCATTT
Nanog   M00315  6.349788        3       10      0       2.914699        GCCCATTT
Nanog   M00640  7.670871        4       11      1       3.518442        CCCATTTC
Nanog   M00651  6.111954        2       10      0       9.501837        GGCCCATTT
Nanog   M00704  8.929859        6       11      1       0.530515        CATTTC
Nanog   M00750  8.028013        6       12      1       2.722466        CATTTCC
Nanog   M01102  8.183558        6       12      0       1.157541        CATTTCC
Nanog   M01107  7.336026        3       12      0       2.396582        GCCCATTTCC
Nanog   M01123  17.852287       1       12      0       0.000000        GGGCCCATTTCC
Nanog   M01243  6.063891        1       7       1       4.879924        GGGCCCA
Nanog   M01281  8.626928        7       12      1       2.102362        ATTTCC
Nanog   M01694  9.504536        5       11      0       1.273142        CCATTTC
Nanog   M01696  7.053711        4       10      0       5.614710        CCCATTT
Nanog   M01886  8.683092        7       12      1       1.874469        ATTTCC
Nanog   M01894  8.822640        5       11      0       1.704351        CCATTTC
Nanog   M03555  8.999804        7       12      1       1.697730        ATTTCC

Using JASPAR

If you don’t have access to the TRANSFAC motifs, you can use an alternative such as JASPAR. Firstly download the set of non-redundant motifs:

wget http://jaspar.genereg.net/html/DOWNLOAD/JASPAR_CORE/pfm/nonredundant/pfm_all.txt
head pfm_all.txt
>MA0004.1 Arnt
4       19      0       0       0       0
16      0       20      0       0       0
0       1       0       20      0       20
0       0       0       0       20      0
>MA0006.1 Arnt::Ahr
3       0       0       0       0       0
8       0       23      0       0       0
2       23      0       23      0       24
11      1       1       1       24      0

#how many motifs?
cat pfm_all.txt | grep &quot;&gt;&quot; | wc -l
593

For this file to work with TESS, we need transpose the JASPAR matrix.

#!/bin/env perl

use strict;
use warnings;

my $usage = "Usage: $0 <infile.jaspar>\n";
my $infile = shift or die $usage;

open(IN,'<',$infile) || die "Could not open $infile: $!\n";
while(<IN>){
   chomp;
#>MA0004.1 Arnt
#4       19      0       0       0       0
#16      0       20      0       0       0
#0       1       0       20      0       20
#0       0       0       0       20      0
   if (/^>/){
      s/\s+/_/g;
      print "$_\n";
      my $freq = [];
      for(1..4){
         chomp(my $line = <IN>);
         my @line = split(/\s+/, $line);
         push($freq,\@line);
      }
      for(my $i=0; $i < scalar(@{$freq->[0]}); ++$i){
         my $line_to_print = '';
         for(my $j=0; $j < scalar(@{$freq}); ++$j){
            $line_to_print .= "$freq->[$j][$i]\t";
         }
         $line_to_print =~ s/\t$//;
         print "$line_to_print\n";
      }
      print "<\n";
   }
}
close(IN);

exit(0);

Now transpose the JASPAR matrices and run TESS:

transpose.pl pfm_all.txt > pfm_all.tess

head pfm_all.tess
#>MA0004.1 Arnt
#4       16      0       0
#19      0       1       0
#0       20      0       0
#0       0       20      0
#0       0       0       20
#0       0       20      0
#<
#>MA0006.1 Arnt::Ahr
#3       8       2       11

#have a look at the test file
cat hb.fa
#>Hb
#GCATAAAAAA

#run TESS
tessWms -mat pfm_all.tess -seq hb.fa -fmt t
Opened sequence file 'hb.fa'
Opened output file '-'
Opened matrix file 'pfm_all.tess'
Loaded 593 matrices.
Hb      MA0049.1_hb     14.213744       1       10      0       0.000000        GCATAAAAAA
Hb      MA0151.1_ARID3A 6.890386        3       8       0       3.786596        ATAAAA
Hb      MA0317.1_HCM1   6.363701        3       10      0       6.108524        ATAAAAAA
Hb      MA0356.1_PHO2   7.190793        3       8       0       1.732304        ATAAAA

So as expected, the top scoring motif match is hunchback (MA0049.1_hb).

Conclusions

Transcription Element Search System or TESS was the transcription factor binding site prediction tool that did what I wanted: given an input DNA sequence, find all putative binding sites with the log-odds score. Out of all the web tools, TFSEARCH was the simplest to use and worked well, however given the missing number of motifs and the lack of documentation, I had to resort to something else. ConSite had some nice visualisation functions but again it was missing the TF that was of interest to me.

Update: 2013 October 4th. The author of TESS also gave me some very valuable advice regarding the calculation of the background probability and the significance of finding a result when it matches a PWM. I have summarised his sage advice below.

There are a number ways to calculate the background probability and the choices can depend on some fairly subtle distinctions. As far as tessWms goes, it uses a uniform background by default, but you can override this using the -lob option.

However, he does not recommend using a background model and in his opinion:

It is better, I think, to compare the distribution of scores in your test sequences to the distribution of scores in a carefully constructed set of control sequences.

He mentioned that some programs like HOMER have assembled genome-wide sets of promoter regions as a standard control however, this would not account for the GC bias that may be exist in our test set. So it is better to match the overall distribution of base compositions in the test and control sets. This was also some advice I received from a colleague. To conclude:

If you go with adjusting the background model, then in the CpG-poor situation, you could use a uniform or slightly AT-rich background model, but it doesn’t really tell you if the matches you are seeing are significant. It just down-weights AT-rich matches and increases the CG-rich matches. Usually a single PWM match is not significant by itself.

Print Friendly, PDF & Email



Creative Commons License
This work is licensed under a Creative Commons
Attribution 4.0 International License
.
10 comments Add yours
  1. Hi, Many thanks for this post. It seems difficult to screen for TF binding sites (particularly such as VDR). I am really new to bioinformatics and biostatistics (and still at the very beginning of the learning curve), but I cannot wait to give it a try. Very interesting blog.

    Best,
    Vassil

  2. Thanks for a useful review of options for mapping TFBSs.

    I think the biggest trouble in this area of bioinformatics is having ready access to the necessary position weight matrices. That’s really what’s behind so many algorithms missing the Nanog binding site in your examples. Your solution (to generate your own PWM) is great if you know what TF you are looking for, but for discovery of putative TF binding sites in raw sequence data, we are all stuck finding whatever is available in the tool/algorithm we use, and missing anything else that might be present but not listed in that tool’s underlying data matrices.

    I recognize that’s not your problem to solve; just musing on the possibilities and challenges in what you present here.

  3. Very nice post. I was oddly relieved that I’m not the only person struggling with what should be a simple task.

    I was wondering if you knew about LASAGNA
    http://www.biotechniques.com/BiotechniquesJournal/2013/March/LASAGNA-Search-an-integrated-web-tool-for-transcription-factor-binding-site-search-and-visualization/biotechniques-341032.html
    It has a pretty friendly user interface and looks like you can use either established databases (e.g. JASPAR Core) or user defined PWM.

  4. Very nice post. And give me lots of useful advise. I’m sorry, but i didnt get if the probabilities of each base should be set as the test sequence or uniform background or control sets?

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.