Sequence logos with R

Updated 2014 September 4th to include a script that parses the TRANSFAC matrix.dat file

The WebLogo tool allows you to create sequence logos based on multiple sequence alignments. However if you want to create a vector image of a sequence logo based on position frequency matrices we need another resource.

I found the Bioconductor package seqLogo and it seems perfect for this task. For this post I will use the Transcription Factor Binding Site (TFBS) for Hunchback available from TRANSFAC and JASPAR for creating sequence logos with R.

Here's the relevant information from the TRANSFAC matrix.dat file for Hunchback:

AC  M00022
XX
ID  I$HB_01
XX
DT  03.12.1992 (created); ewi.
DT  29.03.2012 (updated); sla.
CO  Copyright (C), Biobase GmbH.
XX
NA  Hb
XX
DE  Hunchback
XX
TY  specific
XX
OS  Insecta
OC  eukaryota; animalia; metazoa; arthropoda; insecta
XX
BF  T00395; Hb; Species: fruit fly, Drosophila melanogaster; site(s) included: yes.
XX
P0      A      C      G      T
01      1      5      8      2      S
02      6      8      2      0      M
03      9      3      4      0      A
04      4      3      1      8      T
05     13      1      0      2      A
06     16      0      0      0      A
07     16      0      0      0      A
08     14      0      2      0      A
09     15      1      0      0      A
10      9      2      2      3      A
XX
BA  16 sequences (compiled matrix imported from literature reference)
XX
BS  GCCAAAAAAA; R05431; 1; 10;; p.
BS  CGCAAAAAAA; R05432; 1; 10;; p.
BS  TCATAAAAAC; R05433; 1; 10;; p.
BS  GCATTAAAAA; R05434; 1; 10;; p.
BS  TCGTAAAAAC; R05435; 1; 10;; p.
BS  GAATTAAAAA; R05436; 1; 10;; p.
BS  GGGAAAAAAA; R05437; 1; 10;; p.
BS  GCAGCAAAAA; R05438; 1; 10;; p.
BS  GAACAAAAAA; R05439; 1; 10;; p.
BS  GAACAAAGAA; R05440; 1; 10;; p.
BS  CCGTAAAGAG; R05441; 1; 10;; p.
BS  GCACAAAAAT; R05442; 1; 10;; p.
BS  AAATAAAAAA; R05443; 1; 10;; p.
BS  CAATAAAAAG; R05444; 1; 10;; p.
BS  CACTAAAAAT; R05445; 1; 10;; p.
BS  CCGAAAAACT; R05446; 1; 10;; p.
XX
CC  16 Hb binding sites within the eve promoter
XX
DR  JASPAR: MA0049.
XX
RN  [1]; RE0001854.
RX  PUBMED: 2507923.
RA  Stanojevic D., Hoey T., Levine M.
RT  Sequence-specific DNA-binding activities of the gap proteins encoded by hunchback and Krueppel in Drosophila
RL  Nature 341:331-335 (1989).
XX

Using the nucleotide frequencies from above, we can create a sequence logo using R:

#install if necessary
source("http://bioconductor.org/biocLite.R")
biocLite("seqLogo")

library(seqLogo)
#using the frequencies above, create four vectors
a <- c(1,6,9,4,13,16,16,14,15,9)
c <- c(5,8,3,3,1,0,0,0,1,2)
g <- c(8,2,4,1,0,0,0,2,0,2)
t <- c(2,0,0,8,2,0,0,0,0,3)

#create data frame using the four vectors
df <- data.frame(a,c,g,t)
df
    a c g t
1   1 5 8 2
2   6 8 2 0
3   9 3 4 0
4   4 3 1 8
5  13 1 0 2
6  16 0 0 0
7  16 0 0 0
8  14 0 2 0
9  15 1 0 0
10  9 2 2 3

#define function that divides the frequency by the row sum i.e. proportions
proportion <- function(x){
   rs <- sum(x);
   return(x / rs);
}

#create position weight matrix
pwm <- apply(df, 1, proportion)
pwm <- makePWM(pwm)
postscript("hunchback.eps")
seqLogo(pwm)
dev.off()

hunchback-01

Parsing the matrix.dat file

I wrote a script that takes as input the matrix.dat file from TRANSFAC and an accession ID and generates the code above, so we don't have to type it each time we want to create sequence logos for a particular transcription factor:

#!/bin/env perl

use strict;
use warnings;

my $usage = "Usage: $0 <matrix.dat> <accession>\n";
my $infile = shift or die $usage;
my $wanted = shift or die $usage;

my $accession = '';
my $start = 0;
my $entry = [];
my $switch = 0;
my @nuc = ('a', 'c', 'g', 't');

print <<EOF;
library(seqLogo)
proportion <- function(x){
   rs <- sum(x);
   return(x / rs);
}
EOF

open(IN, '<', $infile) || die "Could not open $infile: $!\n";
while(<IN>){
   chomp;
   #AC  M00001
   if (/^AC\s+(.*)$/){
      $accession = $1;
   }
   #NA  MyoD
   if (/^NA\s+(.*)$/){
      $accession .= "_$1";
      if ($accession =~ /$wanted/i){
         $switch = 1;
      }
   }
   #P0      A      C      G      T
   if (/^P0/){
      $start = 1;
      $entry->[0] = "$accession";
      next;
   #XX
   } elsif (/^XX/ && $start == 1){
      $start = 0;
      if ($switch == 1){
         #four nucleotides
         print "#$entry->[0]\n";
         for (my $i=0; $i<4; ++$i){
            my $nuc = "$nuc[$i] <- c(";
            for (my $j=1; $j<scalar(@{$entry}); ++$j){
               $nuc .= "$entry->[$j]->[$i],";
            }
            $nuc =~ s/\,$/)/;
            print "$nuc\n";
         }
print <<EOF;
df <- data.frame(a,c,g,t)
pwm <- apply(df, 1, proportion)
pwm <- makePWM(pwm)
seqLogo(pwm)

EOF  
      }
      $entry = [];
      $switch = 0;
   }
   #P0      A      C      G      T
   #01      1      2      2      0      S
   #02      2      1      2      0      R
   #03      3      0      1      1      A
   #04      0      5      0      0      C
   #05      5      0      0      0      A
   #06      0      0      4      1      G
   #07      0      1      4      0      G
   #08      0      0      0      5      T
   #09      0      0      5      0      G
   #10      0      1      2      2      K
   #11      0      2      0      3      Y
   #12      1      0      3      1      G
   #XX
   if ($start == 1){
      my @nuc = split(/\s+/);
      #remove last column
      pop(@nuc);
      #remove first column
      shift(@nuc);
      my $line = scalar(@{$entry});
      for (my $i=0; $i< scalar(@nuc); ++$i){
         $entry->[$line]->[$i] = $nuc[$i];
      }
   }
}
close(IN);

exit(0);

Running the script:

parse_matrix_dat.pl matrix.dat M00022
#we can paste the output into R
library(seqLogo)
proportion <- function(x){
   rs <- sum(x);
   return(x / rs);
}
#M00022_Hb
a <- c(1,6,9,4,13,16,16,14,15,9)
c <- c(5,8,3,3,1,0,0,0,1,2)
g <- c(8,2,4,1,0,0,0,2,0,2)
t <- c(2,0,0,8,2,0,0,0,0,3)
df <- data.frame(a,c,g,t)
pwm <- apply(df, 1, proportion)
pwm <- makePWM(pwm)
seqLogo(pwm)

More information

See the official documentation.

10,000 monthly visitors, apparently

I created davetang.org on the 24th of April 2009 just for the sake of buying a domain with my name in it. Realising that I was and am paying for a service, I decided to actually make use of my web space. But it really started to become handy when I decided to pursue a PhD in April 2010; initially I just dumped everything I was learning onto my wiki and this blog. 4 years after creating davetang.org, I'm getting ~10,000 unique visitors to my domain each month according to AWStats, which my web hosting company has conveniently provided.

visitors-01The dip around April, May and June in 2012 was due to the removal of AWStats by my web host. I guess enough people complained, so they brought it back. And if this trend continues, I will need to pay for more bandwidth.

I'm guessing the increase in visitors each month is due to the growing interest in RNA sequencing. Using Google Trends (formerly known as Google insight), we observe the growing web interest into the term RNA-Seq:

Of course I am doing a much better job (in my opinion) of writing better and more useful posts. To celebrate, I changed the header image to a photo I took while playing around with long exposure times. I took the photo near the Tokyo station in Japan; it looks cool so I thought I'll make it the header. I also changed the background to black because it's easier on the eyes.

Using SQL on R data.frames

In R, I typically use data.frames to hold all my data. I've wondered when I should just use a matrix instead of a data.frame and this was nicely answered. It is also quite easy to perform operations on data.frames to obtain a subset of the data.

However if I had two data frames and wanted to connect them based on a common column, nothing seems more appropriate than a SQL join clause (especially if you've worked with relational databases before). Then I found this package in R that lets you perform SQL queries on data.frames (for example, joins!).

Installing:

install.packages("sqldf")
library("sqldf")

Testing:

library(sqldf)
sqldf("select * from iris limit 5")
Loading required package: tcltk
Error in structure(.External("dotTcl", ..., PACKAGE = "tcltk"), class = "tclObj") :
  [tcl] unknown math function "min".

After reading the troubleshooting page, I found out there's some problem with the tcltk library on the server. I don't have root access to the server running CentOS, so I just went with using the slower R engine.

#make sure your gsubfn version is gsubfn 0.6-4 or later
packageVersion("gsubfn")
[1] '0.6.5'
options(gsubfn.engine = "R")
sqldf("select * from iris limit 5")
  Sepal_Length Sepal_Width Petal_Length Petal_Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa

On the contrary, I installed sqldf on my Windows 7 laptop running R 2.15.2 and it worked without any errors.

Getting table information

sqldf("pragma table_info(iris)")
  cid         name type notnull dflt_value pk
1   0 Sepal_Length REAL       0       <NA>  0
2   1  Sepal_Width REAL       0       <NA>  0
3   2 Petal_Length REAL       0       <NA>  0
4   3  Petal_Width REAL       0       <NA>  0
5   4      Species TEXT       0       <NA>  0

More examples

For more examples, see the documentation.

Defining genomic regions

Updated 2014 June 24th to use GENCODE version 19

RNA sequencing (RNA-Seq) reads are typically mapped back to the genome (or transcriptome in some cases) after sequencing. The next task is to annotate the reads, to see which regions the reads mapped to. Typically one creates an annotation file and compares the coordinates of the mapped reads to the annotation file. Creating this annotation file is quite easy using BEDTools; in this post I refer to the creation of the annotation file as defining genomic regions, since in the end I will have several files that contain coordinates of exonic, intronic, and intergenic regions. I will define these regions with respect to GENCODE annotations.

Continue reading

Using the Bioconductor GenomicRanges package

Updated: 2015 July 11th as per comment

From the manual:

The GenomicRanges package serves as the foundation for representing genomic locations within the Bioconductor project.

Installing the package:

source("http://bioconductor.org/biocLite.R")
biocLite("GenomicRanges")

The GRanges class represents a collection of genomic features that each have a single start and end location on the genome. This includes features such as contiguous binding sites, transcripts, and exons. These objects can be created by using the GRanges constructor function.

The manual's example for creating a GRanges object:


library("GenomicRanges")
gr <- GRanges(seqnames = Rle( c("chr1", "chr2", "chr1", "chr3"), c(1, 3, 2, 4) ),
ranges = IRanges(1:10, end = 7:16, names = head(letters, 10)),
strand = Rle(strand(c("-", "+", "*", "+", "-")), c(1, 2, 2, 3, 2)),
score = 1:10,
GC = seq(1, 0, length=10))

I would like to store my bed file as a GRanges object. Say I have a bed file as such:

chr1 66999824 67210768 NM_032291 0 +
chr1 33546713 33585995 NM_052998 0 +
chr1 48998526 50489626 NM_032785 0 -
chr1 16767166 16786584 NM_001145278 0 +
chr1 16767166 16786584 NM_001145277 0 +
chr1 8384389 8404227 NM_001080397 0 +

Using information from this question


library("GenomicRanges")
data <- read.table("test.bed",header=F)
data
    V1       V2       V3           V4 V5 V6
1 chr1 66999824 67210768    NM_032291  0  +
2 chr1 33546713 33585995    NM_052998  0  +
3 chr1 48998526 50489626    NM_032785  0  -
4 chr1 16767166 16786584 NM_001145278  0  +
5 chr1 16767166 16786584 NM_001145277  0  +
6 chr1  8384389  8404227 NM_001080397  0  +

colnames(data) <- c('chr','start','end','id','score','strand')
bed <- with(data, GRanges(chr, IRanges(start+1, end), strand, score, id=id))
bed
GRanges object with 6 ranges and 2 metadata columns:
      seqnames               ranges strand |     score           id
         <Rle>            <IRanges>  <Rle> | <integer>     <factor>
  [1]     chr1 [66999825, 67210768]      + |         0    NM_032291
  [2]     chr1 [33546714, 33585995]      + |         0    NM_052998
  [3]     chr1 [48998527, 50489626]      - |         0    NM_032785
  [4]     chr1 [16767167, 16786584]      + |         0 NM_001145278
  [5]     chr1 [16767167, 16786584]      + |         0 NM_001145277
  [6]     chr1 [ 8384390,  8404227]      + |         0 NM_001080397
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

elementMetadata(bed)
DataFrame with 6 rows and 2 columns
      score           id
  <integer>     <factor>
1         0    NM_032291
2         0    NM_052998
3         0    NM_032785
4         0 NM_001145278
5         0 NM_001145277
6         0 NM_001080397

Let make another bed file to test out the intersecting function.

chr1 66999822 66999823 1_no 0 +
chr1 33585996 33585997 2_no 0 +
chr1 16786583 16786584 3_yes 0 +
chr1 8384389 8384390 4_yes 0 +


data2 <- read.table("test2.bed",header=F)
colnames(data2) <- c('chr','start','end','id','score','strand')
bed2 <- with(data2, GRanges(chr, IRanges(start+1, end), strand, score, id=id))
bed2
GRanges object with 4 ranges and 2 metadata columns:
      seqnames               ranges strand |     score       id
         <Rle>            <IRanges>  <Rle> | <integer> <factor>
  [1]     chr1 [66999823, 66999823]      + |         0     1_no
  [2]     chr1 [33585997, 33585997]      + |         0     2_no
  [3]     chr1 [16786584, 16786584]      + |         0    3_yes
  [4]     chr1 [ 8384390,  8384390]      + |         0    4_yes
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

#intersect
intersect(bed, bed2)
GRanges object with 2 ranges and 0 metadata columns:
      seqnames               ranges strand
         <Rle>            <IRanges>  <Rle>
  [1]     chr1 [ 8384390,  8384390]      +
  [2]     chr1 [16786584, 16786584]      +
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

However I lose the id information from both bed files. I searched the web but couldn't find a way to include the metadata after the intersect and it seems the metadata just gets stripped.

Thank to a suggestion by Daniel (see comments), there is a way to retain the metadata after the overlap:

subsetByOverlaps(bed, bed2)
GRanges object with 3 ranges and 2 metadata columns:
      seqnames               ranges strand |     score           id
         <Rle>            <IRanges>  <Rle> | <integer>     <factor>
  [1]     chr1 [16767167, 16786584]      + |         0 NM_001145278
  [2]     chr1 [16767167, 16786584]      + |         0 NM_001145277
  [3]     chr1 [ 8384390,  8404227]      + |         0 NM_001080397
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

The 3 RefSeqs that overlapped with features in bed2 are returned.

Matching the overlaps

We can use the GenomicRanges functions called findOverlaps(), queryHits() and subjectHits() functions to create a data frame with the names of the elements that overlap each other:


data <- data.frame(chr = rep('chr1',6),
                   start = c(66999824,33546713,48998526,16767166,16767166,8384389),
                   end = c(67210768,33585995,50489626,16786584,16786584,8404227),
                   id = c('NM_032291','NM_052998','NM_032785','NM_001145278','NM_001145277','NM_001080397'),
                   score = rep(0,6),
                   strand = c('+','+','-','+','+','+'),
                   stringsAsFactors=F)

data
   chr    start      end           id score strand
1 chr1 66999824 67210768    NM_032291     0      +
2 chr1 33546713 33585995    NM_052998     0      +
3 chr1 48998526 50489626    NM_032785     0      -
4 chr1 16767166 16786584 NM_001145278     0      +
5 chr1 16767166 16786584 NM_001145277     0      +
6 chr1  8384389  8404227 NM_001080397     0      +

#3 overlapping snps
my_snp <- data.frame(chr = rep('chr1',3),
                   start = c(66999900,33546793,48998626),
                   end = c(66999901,33546794,48998627),
                   id = c('snp1','snp2','snp3'),
                   score = rep(0,3),
                   strand = c('+','+','-'),
                   stringsAsFactors=F)

my_snp
   chr    start      end   id score strand
1 chr1 66999900 66999901 snp1     0      +
2 chr1 33546793 33546794 snp2     0      +
3 chr1 48998626 48998627 snp3     0      -

#create the GRanges objects
library(GenomicRanges)
data_grange <- with(data,
                    GRanges(chr, IRanges(start,
                                         end,
                                         names=id
                                         ),
                            strand
                            )
                    )

snp_grange <- with(my_snp,
                    GRanges(chr, IRanges(start,
                                         end,
                                         names=id
                                         ),
                            strand
                            )
                    )

#count the overlaps
countOverlaps(data_grange, snp_grange)
   NM_032291    NM_052998    NM_032785 NM_001145278 NM_001145277 NM_001080397 
           1            1            1            0            0            0

#find the overlaps
overlaps <- findOverlaps(data_grange, snp_grange)
overlaps
Hits of length 3
queryLength: 6
subjectLength: 3
  queryHits subjectHits 
   <integer>   <integer> 
 1         1           1 
 2         2           2 
 3         3           3

#create the match
match_hit <- data.frame(names(data_grange)[queryHits(overlaps)],
                        names(snp_grange)[subjectHits(overlaps)],
                        stringsAsFactors=F
                        )

names(match_hit) <- c('query','subject')

#voila
match_hit
      query subject
1 NM_032291    snp1
2 NM_052998    snp2
3 NM_032785    snp3

See also

A tutorial on GenomicRanges.

subsetByOverlaps to keep info from both GRanges objects

My older post on intersectBed.