The process of transcription, is influenced by the interaction of proteins called transcription factors (TFs) that bind to specific sites called Transcription Factor Binding Sites (TFBSs), which are proximal or distal to a transcription starting site. TFs generally have distinct binding preferences towards specific TFBSs, however TFs can tolerate variations in the target TFBS. Thus to model a TFBS, the nucleotides are weighted accordingly, to the tolerance of the TF. One common way to represent this is by using a position weight matrix (PWM), also called position-specific weight matrix (PSWM) or position-specific scoring matrix (PSSM), which is a commonly used representation of motifs (in our case TFBS) in biological sequences.
How do we find TFBSs? DNA sequences that interact with TFs can be experimentally determined from SELEX experiments. Since this process involves synthesis of a large number of randomly generated oligonucleotides, DNA sequences that interact with TFs can be determined, as well as the tolerance at specific sites. From SELEX experiments, a position frequency matrix (PFM) can be constructed by recording the position-dependent frequency of each nucleotide in the DNA sequence that interacted with the TF. Here's an example of a PFM as shown in this review "Applied bioinformatics for the identification of regulatory elements" (sorry paywall!):
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | |
A | 0 | 4 | 4 | 0 | 3 | 7 | 4 | 3 | 5 | 4 | 2 | 0 | 0 | 4 |
C | 3 | 0 | 4 | 8 | 0 | 0 | 0 | 3 | 0 | 0 | 0 | 0 | 2 | 4 |
G | 2 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 6 | 8 | 5 | 0 |
T | 3 | 1 | 0 | 0 | 5 | 1 | 4 | 2 | 2 | 4 | 0 | 0 | 1 | 0 |
The construction of this PFM was done by collecting experimentally validated binding sites from 8 published studies for MEF2.
To convert a PFM to the corresponding PWM, the frequencies are converted to normalised frequency values on a log-scale. To perform this conversion we can use these formulae from the review paper:
where = PWM value of base b in position i,
= background probability of base b and
:
where ;
= counts of base b in position i;
= number of sites;
= corrected probability of base b in position i and
= pseudocount function.
The pseudocount is a sample correction that is added when assessing the probability to correct for small samples sizes and this calculation varies widely between applications. One approach is to take the square root of the number of sites that contribute to the model, which was apparently the approach used in the paper. However, when I used this as the pseudocount function, I could not replicate the numbers in the table of the review (Box 1, d). In fact, if you examine the PWM in the paper (Box 1, d), there's several typos. For example, in position 1, the frequency of C's and T's are identical, however the PWM values are different. Also the PWM for a frequency of 1 is different in position 13, T and all other positions with a frequency of 1. Most other values seem to be consistent though. So in order to find out the pseudocount function, I substituted some of the (consistent) PWM values back into the equation.
Firstly recall the conversion between the logarithmic form to the exponential form:
Therefore:
and substituting in:
The PWM value for a frequency of 0 seems to be consistent, so let's take and substitute it into the equation:
The PWM value for a frequency of 3 seems to be 0.45, so let's take and substitute it into the equation:
We now have two simultaneous equations that we can divide together, i.e. dividing by
(the denominators cancel each other out), to work out the pseudocount:
The background probabilities cancel each other out and since the pseudocounts should be the same for the different nucleotides, we can refer to them as an s:
Invert the two sides:
Solving s, we get:
Perhaps I missed it, but it wasn't pointed out exactly how or the background probability of base b is defined. Since we worked out s, we can substitute it into the equation for
and work out
:
So the background probability of base b is simply 0.25, i.e. base b divided by the total number of bases. But where did this 0.71 pseudocount come from? If we take the square root of the number of sites that contribute to the model, the square root of 8 is 2.828427. Perhaps the pseudocount needs to be scaled by base b, since the square root of 8 multiply by 1/4 is 0.7071068.
Since some of the values in the PWM of the review paper is incorrect, let's calculate it again using the formulae above (since we now know the pseudocount function and the background probability) using R:
#function for working out the position weight matrix value pwm <- function(freq, total, bg=0.25){ #using the formulae above p <- (freq + (sqrt(total) * 1/4)) / (total + (4 * (sqrt(total) * 1/4))) log2(p/bg) } #work out all possible PWM values for (i in 0:8){ print(pwm(i,8)) } [1] -1.936752 [1] -0.6651985 [1] 0 [1] 0.4535419 [1] 0.7980888 [1] 1.076008 [1] 1.308939 [1] 1.509438 [1] 1.685442
Now let's calculate the PWM:
#define the frequencies of nucleotides a <- c(0, 4, 4, 0, 3, 7, 4, 3, 5, 4, 2, 0, 0, 4) c <- c(3, 0, 4, 8, 0, 0, 0, 3, 0, 0, 0, 0, 2, 4) g <- c(2, 3, 0, 0, 0, 0, 0, 0, 1, 0, 6, 8, 5, 0) t <- c(3, 1, 0, 0, 5, 1, 4, 2, 2, 4, 0, 0, 1, 0) m <- matrix(data=c(a,c,g,t),nrow=4,byrow=T,dimnames=list(c('a','c','g','t'))) m [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14] a 0 4 4 0 3 7 4 3 5 4 2 0 0 4 c 3 0 4 8 0 0 0 3 0 0 0 0 2 4 g 2 3 0 0 0 0 0 0 1 0 6 8 5 0 t 3 1 0 0 5 1 4 2 2 4 0 0 1 0 mm <- pwm(m,8) mm [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] a -1.9367518 0.7980888 0.7980888 -1.936752 0.4535419 1.5094376 0.7980888 0.4535419 1.0760078 c 0.4535419 -1.9367518 0.7980888 1.685442 -1.9367518 -1.9367518 -1.9367518 0.4535419 -1.9367518 g 0.0000000 0.4535419 -1.9367518 -1.936752 -1.9367518 -1.9367518 -1.9367518 -1.9367518 -0.6651985 t 0.4535419 -0.6651985 -1.9367518 -1.936752 1.0760078 -0.6651985 0.7980888 0.0000000 0.0000000 [,10] [,11] [,12] [,13] [,14] a 0.7980888 0.000000 -1.936752 -1.9367518 0.7980888 c -1.9367518 -1.936752 -1.936752 0.0000000 0.7980888 g -1.9367518 1.308939 1.685442 1.0760078 -1.9367518 t 0.7980888 -1.936752 -1.936752 -0.6651985 -1.9367518
Now that we have the PWM, we can generate a quantitative score for any DNA sequence (of the same length) by summing the values that correspond to the observed nucleotides at each position. Using R again:
#same sequence shown in the paper seq <- 'ttacataagtagtc' x <- strsplit(x=seq,split='') x [[1]] [1] "t" "t" "a" "c" "a" "t" "a" "a" "g" "t" "a" "g" "t" "c" #initialise vector seq_score <- vector() #get the corresponding values for (i in 1:14){ seq_score[i] <- mm[x[[1]][i],i] } seq_score [1] 0.4535419 -0.6651985 0.7980888 1.6854416 0.4535419 -0.6651985 0.7980888 0.4535419 -0.6651985 0.7980888 [11] 0.0000000 1.6854416 -0.6651985 0.7980888 #slightly different score to the paper due to their rounding sum(seq_score) [1] 5.26307 #max score sum(apply(mm,2,max)) [1] 14.31481
Lastly, I've shown previously how to generate a sequence logo from a PFM, and following the same instructions:
#install if necessary source("http://bioconductor.org/biocLite.R") biocLite("seqLogo") library(seqLogo) a <- c(0, 4, 4, 0, 3, 7, 4, 3, 5, 4, 2, 0, 0, 4) c <- c(3, 0, 4, 8, 0, 0, 0, 3, 0, 0, 0, 0, 2, 4) g <- c(2, 3, 0, 0, 0, 0, 0, 0, 1, 0, 6, 8, 5, 0) t <- c(3, 1, 0, 0, 5, 1, 4, 2, 2, 4, 0, 0, 1, 0) df <- data.frame(a,c,g,t) df a c g t 1 0 3 2 3 2 4 0 3 1 3 4 4 0 0 4 0 8 0 0 5 3 0 0 5 6 7 0 0 1 7 4 0 0 4 8 3 3 0 2 9 5 0 1 2 10 4 0 0 4 11 2 0 6 0 12 0 0 8 0 13 0 2 5 1 14 4 4 0 0 #define function that divides the frequency by the row sum i.e. proportions proportion <- function(x){ rs <- sum(x); return(x / rs); } #create position weight matrix mef2 <- apply(df, 1, proportion) mef2 <- makePWM(mef2) seqLogo(mef2)
Conclusions
With respect to transcription factors (TFs), a position weight matrix (PWM) can be generated from a position frequency matrix (PFM), which is a collection of experimentally validated binding sites. Using this PWM, any given sequence can be quantitatively scored against the motif model. The PWM models appropriately the tolerance of TFs to binding sites and one can use sequence logos to visualise PFMs.
I'd like to thank my colleague (if he ever comes across this page) for his help.

This work is licensed under a Creative Commons
Attribution 4.0 International License.
Would be helpful to also have some explanation about information content in the PWM , as that’s what the logo is actually showing
Agreed, it would be useful to have some more detailed info about what this graphic is actually telling us, from a TF-binding perspective.
Dear Dave,
thank you for your excellent blogs!
Just a small typo: The equation after “There:” (beginning with W_{b, i}) the middle equation sign should be a doublearrow like in the equation before.
Thank you! I’ve made the amendment.
Dear Dave,
Thanks for your lecture here. I have a question. Please, how do you convert PWM to count matrix or PFM?
I have a PWM (with negative numbers) for different core promoters and I want to use them to scan for motif in the promoter sequence. The online tool I am using doesn’t allow negative so I couldn’t scan my promoters. I will really appreciate your thoughts. Thanks.
Hi Sunday,
do you have any information on how the PWM was calculated? If you have that information you should be able to revert the weight matrix back into the frequency matrix.
Alternatively, you could use your PWM and the seq_score approach in the blog post to score your promoters.
I also have an updated version of this blog post at https://davetang.github.io/muse/pwm.html.
Hope that helps,
Dave