I have written about sequence conservation in vertebrates previously but without much elaboration, hence I'm writing another post on this topic. An assumption of sequence conservation is that regions that show conservation, are under purifying selection, i.e. alleles that decrease the fitness of an organism are removed, and therefore probably do something important. Protein-coding regions are typically well conserved among the genomes of different species, so it's widely accepted that they are useful. Sequences need to be aligned together in order to infer sequence conservation and conveniently, a multiple sequence alignment (MSA) of 46 vertebrate genomes is provided at the UCSC Genome Browser site.
Updated 2013 December 17th to include JASPAR
I have a simple task: given a short DNA sequence and I want to know if there are any potential transcription factor binding sites within this sequence. I looked online and found this transcription factor binding site prediction tool called TFSEARCH. It's very straight-forward; all you have to do is input a sequence, which may explain its popularity (based on the site's counter and Google's pagerank for the site).
So I decided to test the tool out by inputting a sequence that matches maximally for the Hunchback transcription factor:
This is the main output of TFSEARCH after some formatting:
The process of transcription, is influenced by the interaction of proteins called transcription factors (TFs) that bind to specific sites called Transcription Factor Binding Sites (TFBSs), which are proximal or distal to a transcription starting site. TFs generally have distinct binding preferences towards specific TFBSs, however TFs can tolerate variations in the target TFBS. Thus to model a TFBS, the nucleotides are weighted accordingly, to the tolerance of the TF. One common way to represent this is by using a position weight matrix (PWM), also called position-specific weight matrix (PSWM) or position-specific scoring matrix (PSSM), which is a commonly used representation of motifs (in our case TFBS) in biological sequences.
How do we find TFBSs? DNA sequences that interact with TFs can be experimentally determined from SELEX experiments. Since this process involves synthesis of a large number of randomly generated oligonucleotides, DNA sequences that interact with TFs can be determined, as well as the tolerance at specific sites. From SELEX experiments, a position frequency matrix (PFM) can be constructed by recording the position-dependent frequency of each nucleotide in the DNA sequence that interacted with the TF. Here's an example of a PFM as shown in this review "Applied bioinformatics for the identification of regulatory elements" (sorry paywall!):
Updated 2013 October 4th. Recently I've been looking into transcription factor binding site analyses. With my mind set on this, I thought I'll brush up this old post.
MEME is a tool for discovering motifs in a group of related DNA or protein sequences.
As a discovery tool, it is able to find de novo motifs. As kind of a silly test for this software, I wrote a Perl script that inserts a motif randomly in a set of random sequences.