Sequence logos with R

Updated 2014 September 4th to include a script that parses the TRANSFAC matrix.dat file

The WebLogo tool allows you to create sequence logos based on multiple sequence alignments. However if you want to create a vector image of a sequence logo based on position frequency matrices we need another resource.

I found the Bioconductor package seqLogo and it seems perfect for this task. For this post I will use the Transcription Factor Binding Site (TFBS) for Hunchback available from TRANSFAC and JASPAR for creating sequence logos with R.

Here's the relevant information from the TRANSFAC matrix.dat file for Hunchback:

AC  M00022
XX
ID  I$HB_01
XX
DT  03.12.1992 (created); ewi.
DT  29.03.2012 (updated); sla.
CO  Copyright (C), Biobase GmbH.
XX
NA  Hb
XX
DE  Hunchback
XX
TY  specific
XX
OS  Insecta
OC  eukaryota; animalia; metazoa; arthropoda; insecta
XX
BF  T00395; Hb; Species: fruit fly, Drosophila melanogaster; site(s) included: yes.
XX
P0      A      C      G      T
01      1      5      8      2      S
02      6      8      2      0      M
03      9      3      4      0      A
04      4      3      1      8      T
05     13      1      0      2      A
06     16      0      0      0      A
07     16      0      0      0      A
08     14      0      2      0      A
09     15      1      0      0      A
10      9      2      2      3      A
XX
BA  16 sequences (compiled matrix imported from literature reference)
XX
BS  GCCAAAAAAA; R05431; 1; 10;; p.
BS  CGCAAAAAAA; R05432; 1; 10;; p.
BS  TCATAAAAAC; R05433; 1; 10;; p.
BS  GCATTAAAAA; R05434; 1; 10;; p.
BS  TCGTAAAAAC; R05435; 1; 10;; p.
BS  GAATTAAAAA; R05436; 1; 10;; p.
BS  GGGAAAAAAA; R05437; 1; 10;; p.
BS  GCAGCAAAAA; R05438; 1; 10;; p.
BS  GAACAAAAAA; R05439; 1; 10;; p.
BS  GAACAAAGAA; R05440; 1; 10;; p.
BS  CCGTAAAGAG; R05441; 1; 10;; p.
BS  GCACAAAAAT; R05442; 1; 10;; p.
BS  AAATAAAAAA; R05443; 1; 10;; p.
BS  CAATAAAAAG; R05444; 1; 10;; p.
BS  CACTAAAAAT; R05445; 1; 10;; p.
BS  CCGAAAAACT; R05446; 1; 10;; p.
XX
CC  16 Hb binding sites within the eve promoter
XX
DR  JASPAR: MA0049.
XX
RN  [1]; RE0001854.
RX  PUBMED: 2507923.
RA  Stanojevic D., Hoey T., Levine M.
RT  Sequence-specific DNA-binding activities of the gap proteins encoded by hunchback and Krueppel in Drosophila
RL  Nature 341:331-335 (1989).
XX

Using the nucleotide frequencies from above, we can create a sequence logo using R:

#install if necessary
source("http://bioconductor.org/biocLite.R")
biocLite("seqLogo")

library(seqLogo)
#using the frequencies above, create four vectors
a <- c(1,6,9,4,13,16,16,14,15,9)
c <- c(5,8,3,3,1,0,0,0,1,2)
g <- c(8,2,4,1,0,0,0,2,0,2)
t <- c(2,0,0,8,2,0,0,0,0,3)

#create data frame using the four vectors
df <- data.frame(a,c,g,t)
df
    a c g t
1   1 5 8 2
2   6 8 2 0
3   9 3 4 0
4   4 3 1 8
5  13 1 0 2
6  16 0 0 0
7  16 0 0 0
8  14 0 2 0
9  15 1 0 0
10  9 2 2 3

#define function that divides the frequency by the row sum i.e. proportions
proportion <- function(x){
   rs <- sum(x);
   return(x / rs);
}

#create position weight matrix
pwm <- apply(df, 1, proportion)
pwm <- makePWM(pwm)
postscript("hunchback.eps")
seqLogo(pwm)
dev.off()

hunchback-01

Parsing the matrix.dat file

I wrote a script that takes as input the matrix.dat file from TRANSFAC and an accession ID and generates the code above, so we don't have to type it each time we want to create sequence logos for a particular transcription factor:

#!/bin/env perl

use strict;
use warnings;

my $usage = "Usage: $0 <matrix.dat> <accession>\n";
my $infile = shift or die $usage;
my $wanted = shift or die $usage;

my $accession = '';
my $start = 0;
my $entry = [];
my $switch = 0;
my @nuc = ('a', 'c', 'g', 't');

print <<EOF;
library(seqLogo)
proportion <- function(x){
   rs <- sum(x);
   return(x / rs);
}
EOF

open(IN, '<', $infile) || die "Could not open $infile: $!\n";
while(<IN>){
   chomp;
   #AC  M00001
   if (/^AC\s+(.*)$/){
      $accession = $1;
   }
   #NA  MyoD
   if (/^NA\s+(.*)$/){
      $accession .= "_$1";
      if ($accession =~ /$wanted/i){
         $switch = 1;
      }
   }
   #P0      A      C      G      T
   if (/^P0/){
      $start = 1;
      $entry->[0] = "$accession";
      next;
   #XX
   } elsif (/^XX/ && $start == 1){
      $start = 0;
      if ($switch == 1){
         #four nucleotides
         print "#$entry->[0]\n";
         for (my $i=0; $i<4; ++$i){
            my $nuc = "$nuc[$i] <- c(";
            for (my $j=1; $j<scalar(@{$entry}); ++$j){
               $nuc .= "$entry->[$j]->[$i],";
            }
            $nuc =~ s/\,$/)/;
            print "$nuc\n";
         }
print <<EOF;
df <- data.frame(a,c,g,t)
pwm <- apply(df, 1, proportion)
pwm <- makePWM(pwm)
seqLogo(pwm)

EOF  
      }
      $entry = [];
      $switch = 0;
   }
   #P0      A      C      G      T
   #01      1      2      2      0      S
   #02      2      1      2      0      R
   #03      3      0      1      1      A
   #04      0      5      0      0      C
   #05      5      0      0      0      A
   #06      0      0      4      1      G
   #07      0      1      4      0      G
   #08      0      0      0      5      T
   #09      0      0      5      0      G
   #10      0      1      2      2      K
   #11      0      2      0      3      Y
   #12      1      0      3      1      G
   #XX
   if ($start == 1){
      my @nuc = split(/\s+/);
      #remove last column
      pop(@nuc);
      #remove first column
      shift(@nuc);
      my $line = scalar(@{$entry});
      for (my $i=0; $i< scalar(@nuc); ++$i){
         $entry->[$line]->[$i] = $nuc[$i];
      }
   }
}
close(IN);

exit(0);

Running the script:

parse_matrix_dat.pl matrix.dat M00022
#we can paste the output into R
library(seqLogo)
proportion <- function(x){
   rs <- sum(x);
   return(x / rs);
}
#M00022_Hb
a <- c(1,6,9,4,13,16,16,14,15,9)
c <- c(5,8,3,3,1,0,0,0,1,2)
g <- c(8,2,4,1,0,0,0,2,0,2)
t <- c(2,0,0,8,2,0,0,0,0,3)
df <- data.frame(a,c,g,t)
pwm <- apply(df, 1, proportion)
pwm <- makePWM(pwm)
seqLogo(pwm)

More information

See the official documentation.

Print Friendly, PDF & Email



Creative Commons License
This work is licensed under a Creative Commons
Attribution 4.0 International License
.
19 comments Add yours
  1. Is there a way to change the X axis labels? For example I want to show it as relative positions compared to the start site of a gene rather then just position within the motif.
    Thanks.

    1. Yep. Use the “annotate” option. Like this:

      aln <- c('CCAACCCAA', 'CCAACCCTA', 'AAAGCCTGA', 'TGAACCGGA')
      labels = c("a", "b", "c", "d", "e", "f", "g", "h", "i")
      weblogo(aln, annotate=labels)

      Cheers

  2. Still no luck. I’m using the seqLogo command rather then weblogo, is that the problem?
    a <- c(.1,.08,.07,.08,.06,.06,.04,.08,.08,.07,.06,.19,.02,1,0,.21,.19)
    c <- c(.41,.42,.41,.4,.38,.43,.42,.46,.49,.54,.45,.38,.82,0,0,.13,.21)
    g <- c(.15,.14,.14,.13,.13,.12,.13,.14,.1,.08,.08,.26,0,0,1,.58,.29)
    t <- c(.34,.36,.38,.39,.43,.39,.41,.32,.33,.31,.41,.17,.16,0,0,.08,.31)

    df1 <- data.frame(a,c,g,t)

    labels <-c("-15","-14","-13","-12","-11","-10","-9","-8","-7","-6","-5","-4","-3","-2","-1","1","2")

    z1 = as.matrix(df1[,c("a","c","g","t")])
    z1 = t(z1)

    pwm1 <- makePWM(z1)
    seqLogo(pwm1,ic.scale=FALSE, xfontsize=8, yfontsize=8, annotate=labels)

    dev.off()

    Thanks

    1. Omar was referring to the R package RWebLogo. The seqLogo() function is from the seqLogo package, which is the package that this post is based on. At least from the documentation of seqLogo(), it seems there is no option to change the labels.

  3. how to work on position weight matrix for peptides/protein sequences just having the peptide sequences in hand without a PFM?

    Ex : “ERLRELETKIGSLQEDL”

    1. I saw your question on Biostars too but I don’t understand what you’re asking for. If you have one sequence, there is no frequency of amino acids at each position.

      1. Hi Davo,
        I actually have 1597 peptides/amino acid sequences. I would like to get a Position weight matrix for it. Any more clarification needed ?

    2. seqLogo won’t deal with amino acid peptides. You can’t even modify it to do so, because it draws letters as predefined glyphs, stored in functions. It’s why I wrote RWebLogo. If your peptides are all of same length and stored in a file, simply do

      library(RWebLogo)
      weblogo( seqs = readLines(“/path/to/peptides/file”), file.out = “mylogo.pdf” )

  4. Thanks a lot, this is very useful especially for me, I’ve never used SeqLogo before. Can I ask a maybe stupid question: how can I get the nucleotide frequencies matrix based on my sequences? Or if I can specifically say: for example I have 5 sequences with each sequence 10 bases long, then how can I get its corresponding a,t,g,c frequencies or how can i make the vector of a,t,g,c? should I calculate it manually?
    Many thanks in advance.

    1. I’m not sure if any tool exists to do that, but you can try to write your own function to calculate the frequencies.

      1. Yeah. Is that possible that editing the source code will work? (I’m not sure how though). SeqLogo is an easy tool for plotting. Also I have heard motifStack package will do but since I cannot troubleshoot the errors I’m getting using motiStack (like, Error in getClass(Class, where = topenv(parent.frame())) :
        “pcm” is not a defined class) ), Can you suggest a similar tutorial for plotting RNA sequences?

  5. Hi, I am wondering if there is any guideline on how many fold coverage someone typically needs to determine consensus reliably? 10X, 50X, etc. of the target genome. Thanks.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.