Transcription factor binding site prediction

Updated 2013 December 17th to include JASPAR

I have a simple task: given a short DNA sequence and I want to know if there are any potential transcription factor binding sites within this sequence. I looked online and found this transcription factor binding site prediction tool called TFSEARCH. It's very straight-forward; all you have to do is input a sequence, which may explain its popularity (based on the site's counter and Google's pagerank for the site).

TFSEARCH

So I decided to test the tool out by inputting a sequence that matches maximally for the Hunchback transcription factor:

GCATAAAAAA

This is the main output of TFSEARCH after some formatting:

Continue reading

Position weight matrix

The process of transcription, is influenced by the interaction of proteins called transcription factors (TFs) that bind to specific sites called Transcription Factor Binding Sites (TFBSs), which are proximal or distal to a transcription starting site. TFs generally have distinct binding preferences towards specific TFBSs, however TFs can tolerate variations in the target TFBS. Thus to model a TFBS, the nucleotides are weighted accordingly, to the tolerance of the TF. One common way to represent this is by using a position weight matrix (PWM), also called position-specific weight matrix (PSWM) or position-specific scoring matrix (PSSM), which is a commonly used representation of motifs (in our case TFBS) in biological sequences.

How do we find TFBSs? DNA sequences that interact with TFs can be experimentally determined from SELEX experiments. Since this process involves synthesis of a large number of randomly generated oligonucleotides, DNA sequences that interact with TFs can be determined, as well as the tolerance at specific sites. From SELEX experiments, a position frequency matrix (PFM) can be constructed by recording the position-dependent frequency of each nucleotide in the DNA sequence that interacted with the TF. Here's an example of a PFM as shown in this review "Applied bioinformatics for the identification of regulatory elements" (sorry paywall!):

Continue reading

Transcription factor binding site analysis

Updated 2013 October 4th. Recently I've been looking into transcription factor binding site analyses. With my mind set on this, I thought I'll brush up this old post.

MEME is a tool for discovering motifs in a group of related DNA or protein sequences.

As a discovery tool, it is able to find de novo motifs. As kind of a silly test for this software, I wrote a Perl script that inserts a motif randomly in a set of random sequences.

Continue reading