Creating sequence logos with R

Updated 2014 September 4th to include a script that parses the TRANSFAC matrix.dat file

The WebLogo tool allows you to create sequence logos based on multiple sequence alignments. However if you want to create a vector image of a sequence logo based on position frequency matrices we need another resource.

I found the Bioconductor package seqLogo and it seems perfect for this task. For this post I will use the Transcription Factor Binding Site (TFBS) for Hunchback available from TRANSFAC and JASPAR for creating sequence logos with R.

Here's the relevant information from the TRANSFAC matrix.dat file for Hunchback:

AC  M00022
XX
ID  I$HB_01
XX
DT  03.12.1992 (created); ewi.
DT  29.03.2012 (updated); sla.
CO  Copyright (C), Biobase GmbH.
XX
NA  Hb
XX
DE  Hunchback
XX
TY  specific
XX
OS  Insecta
OC  eukaryota; animalia; metazoa; arthropoda; insecta
XX
BF  T00395; Hb; Species: fruit fly, Drosophila melanogaster; site(s) included: yes.
XX
P0      A      C      G      T
01      1      5      8      2      S
02      6      8      2      0      M
03      9      3      4      0      A
04      4      3      1      8      T
05     13      1      0      2      A
06     16      0      0      0      A
07     16      0      0      0      A
08     14      0      2      0      A
09     15      1      0      0      A
10      9      2      2      3      A
XX
BA  16 sequences (compiled matrix imported from literature reference)
XX
BS  GCCAAAAAAA; R05431; 1; 10;; p.
BS  CGCAAAAAAA; R05432; 1; 10;; p.
BS  TCATAAAAAC; R05433; 1; 10;; p.
BS  GCATTAAAAA; R05434; 1; 10;; p.
BS  TCGTAAAAAC; R05435; 1; 10;; p.
BS  GAATTAAAAA; R05436; 1; 10;; p.
BS  GGGAAAAAAA; R05437; 1; 10;; p.
BS  GCAGCAAAAA; R05438; 1; 10;; p.
BS  GAACAAAAAA; R05439; 1; 10;; p.
BS  GAACAAAGAA; R05440; 1; 10;; p.
BS  CCGTAAAGAG; R05441; 1; 10;; p.
BS  GCACAAAAAT; R05442; 1; 10;; p.
BS  AAATAAAAAA; R05443; 1; 10;; p.
BS  CAATAAAAAG; R05444; 1; 10;; p.
BS  CACTAAAAAT; R05445; 1; 10;; p.
BS  CCGAAAAACT; R05446; 1; 10;; p.
XX
CC  16 Hb binding sites within the eve promoter
XX
DR  JASPAR: MA0049.
XX
RN  [1]; RE0001854.
RX  PUBMED: 2507923.
RA  Stanojevic D., Hoey T., Levine M.
RT  Sequence-specific DNA-binding activities of the gap proteins encoded by hunchback and Krueppel in Drosophila
RL  Nature 341:331-335 (1989).
XX

Using the nucleotide frequencies from above, we can create a sequence logo using R:

#install if necessary
source("http://bioconductor.org/biocLite.R")
biocLite("seqLogo")

library(seqLogo)
#using the frequencies above, create four vectors
a <- c(1,6,9,4,13,16,16,14,15,9)
c <- c(5,8,3,3,1,0,0,0,1,2)
g <- c(8,2,4,1,0,0,0,2,0,2)
t <- c(2,0,0,8,2,0,0,0,0,3)

#create data frame using the four vectors
df <- data.frame(a,c,g,t)
df
    a c g t
1   1 5 8 2
2   6 8 2 0
3   9 3 4 0
4   4 3 1 8
5  13 1 0 2
6  16 0 0 0
7  16 0 0 0
8  14 0 2 0
9  15 1 0 0
10  9 2 2 3

#define function that divides the frequency by the row sum i.e. proportions
proportion <- function(x){
   rs <- sum(x);
   return(x / rs);
}

#create position weight matrix
pwm <- apply(df, 1, proportion)
pwm <- makePWM(pwm)
postscript("hunchback.eps")
seqLogo(pwm)
dev.off()

Parsing the matrix.dat file

I wrote a script that takes as input the matrix.dat file from TRANSFAC and an accession ID and generates the code above, so we don't have to type it each time we want to create sequence logos for a particular transcription factor:

#!/bin/env perl

use strict;
use warnings;

my $usage = "Usage: $0 <matrix.dat> <accession>\n";
my $infile = shift or die $usage;
my $wanted = shift or die $usage;

my $accession = '';
my $start = 0;
my $entry = [];
my $switch = 0;
my @nuc = ('a', 'c', 'g', 't');

print <<EOF;
library(seqLogo)
proportion <- function(x){
   rs <- sum(x);
   return(x / rs);
}
EOF

open(IN, '<', $infile) || die "Could not open $infile: $!\n";
while(<IN>){
   chomp;
   #AC  M00001
   if (/^AC\s+(.*)$/){
      $accession = $1;
   }
   #NA  MyoD
   if (/^NA\s+(.*)$/){
      $accession .= "_$1";
      if ($accession =~ /$wanted/i){
         $switch = 1;
      }
   }
   #P0      A      C      G      T
   if (/^P0/){
      $start = 1;
      $entry->[0] = "$accession";
      next;
   #XX
   } elsif (/^XX/ && $start == 1){
      $start = 0;
      if ($switch == 1){
         #four nucleotides
         print "#$entry->[0]\n";
         for (my $i=0; $i<4; ++$i){
            my $nuc = "$nuc[$i] <- c(";
            for (my $j=1; $j<scalar(@{$entry}); ++$j){
               $nuc .= "$entry->[$j]->[$i],";
            }
            $nuc =~ s/\,$/)/;
            print "$nuc\n";
         }
print <<EOF;
df <- data.frame(a,c,g,t)
pwm <- apply(df, 1, proportion)
pwm <- makePWM(pwm)
seqLogo(pwm)

EOF  
      }
      $entry = [];
      $switch = 0;
   }
   #P0      A      C      G      T
   #01      1      2      2      0      S
   #02      2      1      2      0      R
   #03      3      0      1      1      A
   #04      0      5      0      0      C
   #05      5      0      0      0      A
   #06      0      0      4      1      G
   #07      0      1      4      0      G
   #08      0      0      0      5      T
   #09      0      0      5      0      G
   #10      0      1      2      2      K
   #11      0      2      0      3      Y
   #12      1      0      3      1      G
   #XX
   if ($start == 1){
      my @nuc = split(/\s+/);
      #remove last column
      pop(@nuc);
      #remove first column
      shift(@nuc);
      my $line = scalar(@{$entry});
      for (my $i=0; $i< scalar(@nuc); ++$i){
         $entry->[$line]->[$i] = $nuc[$i];
      }
   }
}
close(IN);

exit(0);

Running the script:

parse_matrix_dat.pl matrix.dat M00022
#we can paste the output into R
library(seqLogo)
proportion <- function(x){
   rs <- sum(x);
   return(x / rs);
}
#M00022_Hb
a <- c(1,6,9,4,13,16,16,14,15,9)
c <- c(5,8,3,3,1,0,0,0,1,2)
g <- c(8,2,4,1,0,0,0,2,0,2)
t <- c(2,0,0,8,2,0,0,0,0,3)
df <- data.frame(a,c,g,t)
pwm <- apply(df, 1, proportion)
pwm <- makePWM(pwm)
seqLogo(pwm)

More information

See the official documentation.

This work is licensed under a Creative Commons
Attribution 4.0 International License.

19 comments Add yours

Pingback: Position weight matrix - Musings from a PhD candidate
pgsin says:

December 5, 2013 at 22:27

Thanks for the article! There is one question – how can I specify the background distribution of nucleotides / amino acids?

1. Davo says:
  
  December 5, 2013 at 23:26
  
  I went through the manual of seqLogo and searched on the Internet but could only find the same question asked two years ago on the Bioconductor mailing list and no solutions.
  
  1. pgsin says:
    
    December 5, 2013 at 23:29
    
    I saw this link. In any case, thanks
    
Omar Wagih says:

March 20, 2014 at 16:20

Check out RWebLogo: http://cran.r-project.org/web/packages/RWebLogo/index.html

Yoav Lubelsky says:

July 3, 2014 at 12:01

Is there a way to change the X axis labels? For example I want to show it as relative positions compared to the start site of a gene rather then just position within the motif.
Thanks.

1. Omar Wagih says:
  
  July 3, 2014 at 22:22
  
  Yep. Use the “annotate” option. Like this:
  
  aln <- c('CCAACCCAA', 'CCAACCCTA', 'AAAGCCTGA', 'TGAACCGGA')
  labels = c("a", "b", "c", "d", "e", "f", "g", "h", "i")
  weblogo(aln, annotate=labels)
  
  Cheers
  
Yoav Lubelsky says:

July 6, 2014 at 06:03

Still no luck. I’m using the seqLogo command rather then weblogo, is that the problem?
a <- c(.1,.08,.07,.08,.06,.06,.04,.08,.08,.07,.06,.19,.02,1,0,.21,.19)
c <- c(.41,.42,.41,.4,.38,.43,.42,.46,.49,.54,.45,.38,.82,0,0,.13,.21)
g <- c(.15,.14,.14,.13,.13,.12,.13,.14,.1,.08,.08,.26,0,0,1,.58,.29)
t <- c(.34,.36,.38,.39,.43,.39,.41,.32,.33,.31,.41,.17,.16,0,0,.08,.31)

df1 <- data.frame(a,c,g,t)

labels <-c("-15","-14","-13","-12","-11","-10","-9","-8","-7","-6","-5","-4","-3","-2","-1","1","2")

z1 = as.matrix(df1[,c("a","c","g","t")])
z1 = t(z1)

pwm1 <- makePWM(z1)
seqLogo(pwm1,ic.scale=FALSE, xfontsize=8, yfontsize=8, annotate=labels)

dev.off()

Thanks

1. Davo says:
  
  July 7, 2014 at 07:59
  
  Omar was referring to the R package RWebLogo. The seqLogo() function is from the seqLogo package, which is the package that this post is based on. At least from the documentation of seqLogo(), it seems there is no option to change the labels.
  
Santosh says:

July 23, 2014 at 13:24

how to work on position weight matrix for peptides/protein sequences just having the peptide sequences in hand without a PFM?

Ex : “ERLRELETKIGSLQEDL”

1. Davo says:
  
  July 23, 2014 at 13:40
  
  I saw your question on Biostars too but I don’t understand what you’re asking for. If you have one sequence, there is no frequency of amino acids at each position.
  
  1. Santosh says:
    
    July 23, 2014 at 14:48
    
    Hi Davo,
    I actually have 1597 peptides/amino acid sequences. I would like to get a Position weight matrix for it. Any more clarification needed ?
    
2. Omar Wagih says:
  
  July 23, 2014 at 14:56
  
  seqLogo won’t deal with amino acid peptides. You can’t even modify it to do so, because it draws letters as predefined glyphs, stored in functions. It’s why I wrote RWebLogo. If your peptides are all of same length and stored in a file, simply do
  
  library(RWebLogo)
  weblogo( seqs = readLines(“/path/to/peptides/file”), file.out = “mylogo.pdf” )
  
Wencai says:

October 24, 2014 at 19:59

Thanks a lot, this is very useful especially for me, I’ve never used SeqLogo before. Can I ask a maybe stupid question: how can I get the nucleotide frequencies matrix based on my sequences? Or if I can specifically say: for example I have 5 sequences with each sequence 10 bases long, then how can I get its corresponding a,t,g,c frequencies or how can i make the vector of a,t,g,c? should I calculate it manually?
Many thanks in advance.

1. Davo says:
  
  November 13, 2014 at 05:33
  
  I’m not sure if any tool exists to do that, but you can try to write your own function to calculate the frequencies.
  
Ted says:

June 1, 2015 at 12:59

Hi,
I want to show U instead of T in the plot. Is there any way of doing that?

1. Davo says:
  
  June 2, 2015 at 09:59
  
  I had a brief look at the seqLogo documentation and I don’t think it’s supported.
  
  1. Ted says:
    
    June 3, 2015 at 05:03
    
    Yeah. Is that possible that editing the source code will work? (I’m not sure how though). SeqLogo is an easy tool for plotting. Also I have heard motifStack package will do but since I cannot troubleshoot the errors I’m getting using motiStack (like, Error in getClass(Class, where = topenv(parent.frame())) :
    “pcm” is not a defined class) ), Can you suggest a similar tutorial for plotting RNA sequences?
    
Supertech says:

June 15, 2018 at 01:16

Hi, I am wondering if there is any guideline on how many fold coverage someone typically needs to determine consensus reliably? 10X, 50X, etc. of the target genome. Thanks.

Sequence logos with R

Parsing the matrix.dat file

More information

Leave a Reply Cancel reply