Updated 2018 November 7th
I wrote this post back in 2013 with the goal of finding a tool that can output a list of motifs within my sequence of interest. Five years later, I think the easiest tool for performing this task is HOMER. Being the huge Simpsons fan that I am, I'm surprised I didn't check HOMER out back then. In addition, some of the tools I tested back then no longer exist (such as TFSEARCH), which is why I decided to update this post.
To get started, let's get HOMER up and running.
mkdir homer cd homer wget http://homer.ucsd.edu/homer/configureHomer.pl perl configureHomer.pl -install
We want to use findMotifs.pl to finding instances of motifs with a sequence. The usage when using FASTA files is:
# findMotifs.pl <targetSequences.fa> fasta <output directory> -fasta <background.fa> [options] -find motif1.motif > outputfile.txt findMotifs.pl targets.fa fasta motifResults/ -fasta background.fa -find my_motif.txt > my_result.txt
First we will use seq2profile.pl to create a HOMER motif file.
seq2profile.pl GCATAAAAAA > my_motif.txt
Check out the motif file.
cat my_motif.txt >GCATAAAAAA GCATAAAAAA 13.8228985209959 0.001 0.001 0.997 0.001 0.001 0.997 0.001 0.001 0.997 0.001 0.001 0.001 0.001 0.001 0.001 0.997 0.997 0.001 0.001 0.001 0.997 0.001 0.001 0.001 0.997 0.001 0.001 0.001 0.997 0.001 0.001 0.001 0.997 0.001 0.001 0.001 0.997 0.001 0.001 0.001
I've create a FASTA file with the motif embedded between some random nucleotides and I created a random background file by mashing A, C, G, and T on my keyboard.
cat test.fa >test TTTTTTTTTTAAAAAAAAAAGCATAAAAAACCCCCCCCCCGGGGGGGGGG cat bg.fa >background AGTCAGCTAGCTAGCTAGCTACTAGC TAGCTAGCTAGCTAGCTAGCTAGTCG ATCGATCGATTATATCTACGATCATG CTACGTAGCTAGCTAGTCGATCGTAC GTACGTACGTGACTGACGATCGTGCA
Now we can run the findMotifs.pl script.
findMotifs.pl test.fa fasta result_dir/ -fasta bg.fa -find my_motif.txt > my_result.txt
Selected Options:
Input file = test.fa
Promoter Set = fasta
Output Directory = result_dir/
Will use FASTA files for motif finding
Target Sequences = test.fa
Background Sequences = bg.fa
Will find motif(s) in my_motif.txt
Using custom gene IDs for GO analysis
Parsing FASTA format files...
Found 1 sequences
Reading input files...
1 total sequences read
1 motifs loaded
Finding instances of 1 motif(s)
|0% 50% 100%|
=================================================================================
cat my_result.txt
FASTA ID Offset Sequence Motif Name Strand MotifScore
test -5 GCATAAAAAA GCATAAAAAA + 13.832899
The columns are self-explanatory except for the "Offset" and the motif score columns. The offset is the position of the found motif from the transcription start site (TSS) and the TSS is assumed to be the middle of the input FASTA file:
TTTTTTTTTTAAAAAAAAAAGCATA - AAAAACCCCCCCCCCGGGGGGGGGG
If I add another 10 nucleotides to the end of test.fa, then the offset will be -10.
cat test.fa
>test
TTTTTTTTTTAAAAAAAAAAGCATAAAAAACCCCCCCCCCGGGGGGGGGGAAAAAAAAAA
findMotifs.pl test.fa fasta result_dir/ -fasta bg.fa -find my_motif.txt
Selected Options:
Input file = test.fa
Promoter Set = fasta
Output Directory = result_dir/
Will use FASTA files for motif finding
Target Sequences = test.fa
Background Sequences = bg.fa
Will find motif(s) in my_motif.txt
Using custom gene IDs for GO analysis
Parsing FASTA format files...
Found 1 sequences
Reading input files...
1 total sequences read
1 motifs loaded
Finding instances of 1 motif(s)
|0% 50% 100%|
=================================================================================
FASTA ID Offset Sequence Motif Name Strand MotifScore
test -10 GCATAAAAAA GCATAAAAAA + 13.832899
Finally, the motif score column is the log odds score of the motif matrix. To understand how this score is calculated, we need to examine our Position Weight Matrix, which is:
>GCATAAAAAA GCATAAAAAA 13.8228985209959 0.001 0.001 0.997 0.001 0.001 0.997 0.001 0.001 0.997 0.001 0.001 0.001 0.001 0.001 0.001 0.997 0.997 0.001 0.001 0.001 0.997 0.001 0.001 0.001 0.997 0.001 0.001 0.001 0.997 0.001 0.001 0.001 0.997 0.001 0.001 0.001 0.997 0.001 0.001 0.001
The first row is the identifier line, followed by the weights for each nucleotide in the order A, C, G, and T; since our motif is 10 nucleotides long, there are 10 rows. Our sequence is GCATAAAAAA, so we obtain the corresponding weights, which are all 0.997 and divide by 0.25 (the probability of any nucleotide) and calculate the natural logarithm. We multiply this by 10 because our sequence matches all the 0.997 weights in the motif.
log(0.997/0.25) [1] 1.38329 log(0.997/0.25) * 10 [1] 13.8329
The following text is from my 2013 post...
I have a simple task: given a short DNA sequence and I want to know if there are any potential transcription factor binding sites within this sequence. I looked online and found this transcription factor binding site prediction tool called TFSEARCH. It's very straight-forward; all you have to do is input a sequence, which may explain its popularity (based on the site's counter and Google's pagerank for the site).
TFSEARCH
So I decided to test the tool out by inputting a sequence that matches maximally for the Hunchback transcription factor:
GCATAAAAAA
This is the main output of TFSEARCH after some formatting:
TFMATRIX entries with High-scoring:
GCATAAAAAA entry score ---------> M00022 Hb 100.0 <------ M00101 CdxA 92.1 <------ M00100 CdxA 91.0
Total 3 high-scoring sites found.
Max score: 100.0 point, Min score: 91.0 point
Great! It found the Hb transcription factor binding site and gave it a score of 100 as expected. I added some bases that flanked the motif and TFSEARCH could still identify Hb. I added a base in between the motif (not shown), and it still could find the motif, but with a reduced score.
However when I input a sequence that matches maximally for Nanog, GGGCCCATTTCC, I don't get any matches to Nanog; probably because TFSEARCH does not have the position weight matrix for this TF. To work out if TFSEARCH has Nanog in their database I probed it using their CGI script (http://www.cbrc.jp/htbin/bget_tfmatrix?). The TRANSFAC id for Nanog is M01123, so we can search http://www.cbrc.jp/htbin/bget_tfmatrix?M01123 and indeed it is missing. The TRANSFAC id for Hunchback is M00022, and http://www.cbrc.jp/htbin/bget_tfmatrix?M00022 is available.
Unfortunately there's no publication associated to TFSEARCH and not much documentation available on the website, so I can't download the source code and use it with my own set of TRANSFAC binding sites. So I moved along to other online tools that I could find.
PROMO
The next hit returned from Google was PROMO. I searched all species and then proceeded to the "SearchSites" tool and used as input the Hunchback sequence but it was missing from the results file (searchRes.txt). I tried the Nanog motif but it was missing as well. Next!
TFBIND
I came across TFBIND and again I couldn't get Hunchback or Nanog. Next!
ConSite
I found ConSite and used the "Analyze single sequence" feature. Just like TFSEARCH, it could find Hunchback but not Nanog. Unfortunately it's built as a web application so it's not available for download. I scanned the associated publication and ConSite seems to be built using the TFBS Perl module but it seems defunct, since the URL (http://forkhead.cgb.ki.se/TFBS/) is broken. Next!
Match
I tried using the publicly available version of Match but yet again Nanog could not be found.
TESS
I was in despair until I came across TESS, which stands for Transcription Element Search System. It seemed like it was a very popular tool, since it had to be removed from their website. However the authors made the source code available and eagerly I tested it:
wget http://www.cbil.upenn.edu/downloads_orig/TESS/TESS_v1.0.zip unzip TESS_v1.0.zip cd TESS_v1.0 make tessWms
According to the README, the tessWms program requires a PWM file; the format is as such:
>PWM_1 defline 1 2 2 0 2 1 2 0 3 0 1 1 0 5 0 0 5 0 0 0 >PWM_2 defline 0 0 4 1 0 1 4 0 0 0 0 5 0 0 5 0 0 1 2 2 0 2 0 3 1 0 3 1
where the columns are observations for A, C, G, and T in that order. The rows can be either counts, normalized counts, or probabilities.
Update: 2013 October 4th. Previously I had prepared a file as described in the README, however, every second entry would be skipped. I contacted the author of the software, and he told me that a "<" is required to end each entry.
So I prepared the motifs in the TRANSFAC matrix.dat file using the following script, which adds the "<" at the end of each entry:
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| #!/bin/env perl | |
| use strict; | |
| use warnings; | |
| my $usage = "Usage: $0 <matrix.dat>\n"; | |
| my $infile = shift or die $usage; | |
| my $accession = ''; | |
| my $start = 0; | |
| my @entry = (); | |
| open(IN, '<', $infile) || die "Could not open $infile: $!\n"; | |
| while(<IN>){ | |
| chomp; | |
| if (/^AC\s+(.*)$/){ | |
| $accession = $1; | |
| } | |
| if (/^NA\s+(.*)$/){ | |
| $accession .= "_$1"; | |
| } | |
| if (/^P0/){ | |
| $start = 1; | |
| $entry[0] = ">$accession"; | |
| next; | |
| } elsif (/^XX/ && $start == 1){ | |
| $start = 0; | |
| foreach my $line (@entry){ | |
| print "$line\n"; | |
| } | |
| print "<\n"; | |
| @entry = (); | |
| } | |
| if ($start == 1){ | |
| my ($junk, $a, $c, $g, $t, $junk2) = split(/\s+/); | |
| $entry[scalar(@entry)] = join("\t", $a, $c, $g, $t); | |
| } | |
| } | |
| close(IN); | |
| exit(0); |
If you run the above script, you can generate a matrix file with the nucleotide frequencies of each motif, which can be used by TESS. Now if I run TESS, voila:
tessWms -mat matrix.tess -seq hb.fa -fmt t Opened sequence file 'hb.fa' Opened output file '-' Opened matrix file 'matrix.tess' Loaded 2201 matrices. #La: (or Score): actual log-odds ratio of the match #Lm: maximum possible log-odds ratio for a match from a given PWM (total information content of the PWM consensus #sequence) #Ld (or Lm - Score): Lm - La #t (tab): #SequenceId MatrixId La HitBegin HitEnd Sense Ld Sequence Oligo Hb M00022 14.213744 1 10 0 0.000000 GCATAAAAAA Hb M00100 6.725922 2 8 1 2.039528 CATAAAA Hb M00101 7.055926 2 8 1 1.068455 CATAAAA Hb M00471 6.633270 2 9 0 6.317433 CATAAAAA Hb M00980 8.376884 1 7 1 2.444785 GCATAAA Hb M00980 7.408779 3 9 1 3.412889 ATAAAAA Hb M01094 11.683704 2 8 0 0.000000 CATAAAA Hb M01097 9.805838 1 10 1 1.635294 GCATAAAAAA Hb M01252 6.109876 3 10 1 6.017624 ATAAAAAA Hb M01292 10.127290 3 8 0 0.000000 ATAAAA Hb M01599 8.138197 3 10 0 3.993047 ATAAAAAA Hb M01654 6.096537 5 10 0 4.710679 AAAAAA Hb M01800 8.512405 5 10 1 0.000000 AAAAAA Hb M01879 6.517766 4 9 1 0.570316 TAAAAA Hb M01879 6.299585 5 10 1 0.788496 AAAAAA Hb M02024 7.923663 1 10 1 6.564609 GCATAAAAAA Hb M02086 9.643571 2 7 1 0.351472 CATAAA Hb M02087 9.347031 2 7 1 0.167457 CATAAA Hb M02355 6.413547 2 9 1 4.743982 CATAAAAA Hb M03564 9.589494 3 8 1 0.000000 ATAAAA Hb M03573 6.982671 3 10 1 6.409391 ATAAAAAA #and now Nanog tessWms -mat matrix.tess -seq nanog.fa -fmt t Opened sequence file 'nanog.fa' Opened output file '-' Opened matrix file 'matrix.tess' Loaded 2201 matrices. Nanog M00313 6.493833 3 10 0 2.808781 GCCCATTT Nanog M00314 6.679016 3 10 0 2.726591 GCCCATTT Nanog M00315 6.349788 3 10 0 2.914699 GCCCATTT Nanog M00640 7.670871 4 11 1 3.518442 CCCATTTC Nanog M00651 6.111954 2 10 0 9.501837 GGCCCATTT Nanog M00704 8.929859 6 11 1 0.530515 CATTTC Nanog M00750 8.028013 6 12 1 2.722466 CATTTCC Nanog M01102 8.183558 6 12 0 1.157541 CATTTCC Nanog M01107 7.336026 3 12 0 2.396582 GCCCATTTCC Nanog M01123 17.852287 1 12 0 0.000000 GGGCCCATTTCC Nanog M01243 6.063891 1 7 1 4.879924 GGGCCCA Nanog M01281 8.626928 7 12 1 2.102362 ATTTCC Nanog M01694 9.504536 5 11 0 1.273142 CCATTTC Nanog M01696 7.053711 4 10 0 5.614710 CCCATTT Nanog M01886 8.683092 7 12 1 1.874469 ATTTCC Nanog M01894 8.822640 5 11 0 1.704351 CCATTTC Nanog M03555 8.999804 7 12 1 1.697730 ATTTCC
Using JASPAR
If you don't have access to the TRANSFAC motifs, you can use an alternative such as JASPAR. Firstly download the set of non-redundant motifs:
wget http://jaspar.genereg.net/html/DOWNLOAD/JASPAR_CORE/pfm/nonredundant/pfm_all.txt head pfm_all.txt >MA0004.1 Arnt 4 19 0 0 0 0 16 0 20 0 0 0 0 1 0 20 0 20 0 0 0 0 20 0 >MA0006.1 Arnt::Ahr 3 0 0 0 0 0 8 0 23 0 0 0 2 23 0 23 0 24 11 1 1 1 24 0 #how many motifs? cat pfm_all.txt | grep ">" | wc -l 593
For this file to work with TESS, we need transpose the JASPAR matrix.
#!/bin/env perl
use strict;
use warnings;
my $usage = "Usage: $0 <infile.jaspar>\n";
my $infile = shift or die $usage;
open(IN,'<',$infile) || die "Could not open $infile: $!\n";
while(<IN>){
chomp;
#>MA0004.1 Arnt
#4 19 0 0 0 0
#16 0 20 0 0 0
#0 1 0 20 0 20
#0 0 0 0 20 0
if (/^>/){
s/\s+/_/g;
print "$_\n";
my $freq = [];
for(1..4){
chomp(my $line = <IN>);
my @line = split(/\s+/, $line);
push($freq,\@line);
}
for(my $i=0; $i < scalar(@{$freq->[0]}); ++$i){
my $line_to_print = '';
for(my $j=0; $j < scalar(@{$freq}); ++$j){
$line_to_print .= "$freq->[$j][$i]\t";
}
$line_to_print =~ s/\t$//;
print "$line_to_print\n";
}
print "<\n";
}
}
close(IN);
exit(0);
Now transpose the JASPAR matrices and run TESS:
transpose.pl pfm_all.txt > pfm_all.tess head pfm_all.tess #>MA0004.1 Arnt #4 16 0 0 #19 0 1 0 #0 20 0 0 #0 0 20 0 #0 0 0 20 #0 0 20 0 #< #>MA0006.1 Arnt::Ahr #3 8 2 11 #have a look at the test file cat hb.fa #>Hb #GCATAAAAAA #run TESS tessWms -mat pfm_all.tess -seq hb.fa -fmt t Opened sequence file 'hb.fa' Opened output file '-' Opened matrix file 'pfm_all.tess' Loaded 593 matrices. Hb MA0049.1_hb 14.213744 1 10 0 0.000000 GCATAAAAAA Hb MA0151.1_ARID3A 6.890386 3 8 0 3.786596 ATAAAA Hb MA0317.1_HCM1 6.363701 3 10 0 6.108524 ATAAAAAA Hb MA0356.1_PHO2 7.190793 3 8 0 1.732304 ATAAAA
So as expected, the top scoring motif match is hunchback (MA0049.1_hb).
Conclusions
Transcription Element Search System or TESS was the transcription factor binding site prediction tool that did what I wanted: given an input DNA sequence, find all putative binding sites with the log-odds score. Out of all the web tools, TFSEARCH was the simplest to use and worked well, however given the missing number of motifs and the lack of documentation, I had to resort to something else. ConSite had some nice visualisation functions but again it was missing the TF that was of interest to me.
Update: 2013 October 4th. The author of TESS also gave me some very valuable advice regarding the calculation of the background probability and the significance of finding a result when it matches a PWM. I have summarised his sage advice below.
There are a number ways to calculate the background probability and the choices can depend on some fairly subtle distinctions. As far as tessWms goes, it uses a uniform background by default, but you can override this using the -lob option.
However, he does not recommend using a background model and in his opinion:
It is better, I think, to compare the distribution of scores in your test sequences to the distribution of scores in a carefully constructed set of control sequences.
He mentioned that some programs like HOMER have assembled genome-wide sets of promoter regions as a standard control however, this would not account for the GC bias that may be exist in our test set. So it is better to match the overall distribution of base compositions in the test and control sets. This was also some advice I received from a colleague. To conclude:
If you go with adjusting the background model, then in the CpG-poor situation, you could use a uniform or slightly AT-rich background model, but it doesn't really tell you if the matches you are seeing are significant. It just down-weights AT-rich matches and increases the CG-rich matches. Usually a single PWM match is not significant by itself.

This work is licensed under a Creative Commons
Attribution 4.0 International License.
i am not a info guy but i feel relaxed when i read this
Hi, Many thanks for this post. It seems difficult to screen for TF binding sites (particularly such as VDR). I am really new to bioinformatics and biostatistics (and still at the very beginning of the learning curve), but I cannot wait to give it a try. Very interesting blog.
Best,
Vassil
Hi Vassil,
thanks! Here’s a classic paper on the identification of TFBSs -> Applied bioinformatics for the identification of regulatory elements. One of the points of the paper is the “futility theorem”, which is essentially all predicted TFBSs will have no function. This is mostly true for ChIP-seq peaks as well; it’s a noisy technology.
Good luck with the bioinformatics quest!
Cheers,
Dave
Thanks for a useful review of options for mapping TFBSs.
I think the biggest trouble in this area of bioinformatics is having ready access to the necessary position weight matrices. That’s really what’s behind so many algorithms missing the Nanog binding site in your examples. Your solution (to generate your own PWM) is great if you know what TF you are looking for, but for discovery of putative TF binding sites in raw sequence data, we are all stuck finding whatever is available in the tool/algorithm we use, and missing anything else that might be present but not listed in that tool’s underlying data matrices.
I recognize that’s not your problem to solve; just musing on the possibilities and challenges in what you present here.
Very nice post. I was oddly relieved that I’m not the only person struggling with what should be a simple task.
I was wondering if you knew about LASAGNA
http://www.biotechniques.com/BiotechniquesJournal/2013/March/LASAGNA-Search-an-integrated-web-tool-for-transcription-factor-binding-site-search-and-visualization/biotechniques-341032.html
It has a pretty friendly user interface and looks like you can use either established databases (e.g. JASPAR Core) or user defined PWM.
Thanks, I was looking for a tool like this!
Very nice post. And give me lots of useful advise. I’m sorry, but i didnt get if the probabilities of each base should be set as the test sequence or uniform background or control sets?
For TESS, the default is to use a uniform frequency.
But which is better, base compositions of test sequence or the default frequency?