Clustering mapped reads

Updated 2014 October 8th to include an analysis using CAGE data from ENCODE and rewrote parts of the post.

In this post I will write about a read clustering method called paraclu, which allows mapped reads to be clustered together. This is particularly useful when working with CAGE data, where transcription start sites (TSSs) are known to initiate at different positions but are all providing information on the promoter activity of a transcript, so it is useful to cluster the TSSs together. In addition, paraclu allows different levels of clustering, so you can choose the level that you want. Furthermore, studying the clusters at different levels can reveal subtle properties of promoters; this is akin to adjusting the bin size of histograms to see if certain properties arise.

Continue reading

UCSC Genome Browser custom overlap tracks

One of the features of the latest update to the UCSC Genome Browser (see http://www.ncbi.nlm.nih.gov/pubmed/20959295), are tracks which overlap or overlay each other. If you're a regular user of the site, you will have noticed the ENCODE ChIP-Seq tracks that have several layers. After doing a bit of searching, I was able to make my own overlapping tracks. Since it's a new feature, there is yet to be official documentation on it. But several users have emailed the mailing list regarding this and if you join up the bits and pieces, you can piece together the instructions for making your own custom overlap tracks.

Things you require are access to a web server, and knowledge of making bigWig files. See this post on how to make bigWig files from BAM files. If you have those, next you need to set up your own track hub; follow the official instructions on how to do this.

Once you have setup your track hub and debugged it using hubCheck tool, adapt the following attributes shown below to your own trackDb.txt file. See this thread for more information.

For my example, I made two bigWig files and I wish to overlap them together as one track. So in my trackDb.txt file, I have something like this (pay attention to the lines in bold):

track all_fun
container multiWig

shortLabel all_fun
longLabel all_fun
type bigWig 0 30000
viewLimits 0:100
visibility full
maxHeightPixels 150:30:11
aggregate transparentOverlay
showSubtrackColorOnUi on
windowingFunction mean
priority 1.4
configurable on
autoScale on

track fun
bigDataUrl fun.bw
shortLabel fun
longLabel fun
parent all_fun
type bigWig
color 180,102,255

track fun2
bigDataUrl fun2.bw
shortLabel fun2
longLabel fun2
type bigWig
parent all_fun
color 136,102,255

The key is to make a container track (which I named all_fun), and for every track that you want to overlap, specify the same container name in the parent field (I have put them in bold above).

For more information on the track attributes have a look at Jim Kent's git hub and the official documentation on wiggle tracks.

Next, the UCSC Genome Browser needs to make it possible to properly display bedGraph files when reads are mapping to the exact same position but on opposite strands. UPDATE 22nd March 2012: As suggested by Xianjun (see comments for this post), to solve the issue of displaying +/- reads that map to the same spot, create two bigWig files, one for the plus strand and the other the minus, and just overlay those into one track as shown above. Worked like a charm.

For more information see http://genome.ucsc.edu/goldenPath/help/trackDb/trackDbHub.html.

Visualising RNA-Seq like data

So you've aligned your reads from an RNA-Seq or RNA-Seq like experiment to the reference genome and have produced a BAM file. You could visualise your BAM file directly by using IGV. This is fine for looking at individual libraries, when looking at several large libraries, this may become an issue. A common strategy is to create coverage plots and here are a few ways.

I will refer to this hg19_chrom_info.txt file several times below. Use the fetchChromSizes executable available at http://hgdownload.cse.ucsc.edu/admin/exe/ to create the chrom.sizes file for the UCSC database you are working with (e.g. hg19).

1. Use the tool genomeCoverageBed from the BEDTools suite to create a coverage plot in BedGraph format.

genomeCoverageBed -bg -ibam file.bam -g hg19_chrom_info.txt

This creates a file where each line shows how many reads (column 4) are mapping to the defined region in columns 1, 2 and 3:

__snipped__
chr1 117499752 117499754 1
chr1 117499754 117499760 2
chr1 117499760 117499774 3
__snipped__

2. You can use IGVTools to create a coverage plot as a tiled data file (TDF) for use with IGV

igvtools count file.bam file.cov.tdf hg19.genome

3. If you use the UCSC Genome Browser, and have a extremely large dataset, you could convert your BedGraph file into a bigWig file. The bedGraphToBigWig executable is available here, which can directly convert your BedGraph file into a BigWig file.

bedGraphToBigWig file.bg hg19_chrom_info.txt file.bw

Some known issues

Recently I found that if you have two reads that map at the exact position but one on the positive strand and the other on the negative strand, the track is not displayed properly. For example, paste the following into the browser

track type=bedGraph name="BedGraph Format" description="BedGraph format" visibility=full color=200,100,0 altColor=0,100,200 priority=20
chr1 59302000 59302300 -1.0
chr1 59302300 59302600 -0.75
chr1 59302600 59302900 -0.50
chr1 59302900 59303200 -0.25
chr1 59302000 59302300 1.0
chr1 59302300 59302600 0.75
chr1 59302600 59302900 0.50
chr1 59302900 59303200 0.25

And the track will look as so:

The coverage plot is missing on the negative strand. An alternative is to create two separate tracks but sometimes I already have enough tracks loaded (which is another problem, how do you look at 20 tracks all at once?) and it is easier to look at one single track.

Using IGV and converting my BAM files into the TDF format created a coverage plot by summing the information on both strands. I found this thread asking for this feature to be implemented, but I couldn't find the feature yet.

The best solution at the moment is just to split up your tracks into two i.e. positive and negative stranded tracks.

Update: The solution for showing negative and positive coverage plots on the same track is to use overlapping tracks.

Transcription factor binding site analysis

Updated 2013 October 4th. Recently I've been looking into transcription factor binding site analyses. With my mind set on this, I thought I'll brush up this old post.

MEME is a tool for discovering motifs in a group of related DNA or protein sequences.

As a discovery tool, it is able to find de novo motifs. As kind of a silly test for this software, I wrote a Perl script that inserts a motif randomly in a set of random sequences.

Continue reading

Motifs upstream of RefSeq gene models

Here's a very primitive way of looking for motifs upstream of RefSeq gene models.

1) Download the upstream sequences (-50) of RefSeq gene models using the UCSC Table Browser tool as a bed file
2) Using the fastaFromBed tool from BEDTools, make fasta files from the bed file

fastaFromBed -s -bed hg19_refgene_upstream_50_080312.bed -fi ~/genome/human/hg19.fa -fo hg19_refgene_upstream_50_080312.fa

3) Look for motifs

Here's the main code and an example to show what it is doing

#!/usr/bin/perl

use strict;
use warnings;

my $seq = 'ACGTACGTACGTACGT';

print "$seq\n";

my $length = length($seq);
foreach my $frame (0 .. 5){
   my $frame_length = $length - $frame;
   #print "Frame length: $frame_length\n";
   my $mod = $frame_length % 6;
   #print "Mod: $mod\n";
   $frame_length -= $mod;
   #print "Frame length: $frame_length\n";
   for (my $i = $frame; $frame<$frame_length; $frame=$frame+6){
      my $mer = substr($seq,$frame,6);
      my $x = ' ' x $frame;
      print "$x$mer\n";
   }
}
#ACGTACGTACGTACGT
#ACGTAC
#      GTACGT
# CGTACG
#       TACGTA
#  GTACGT
#        ACGTAC
#   TACGTA
#         CGTACG
#    ACGTAC
#          GTACGT
#     CGTACG

Now to calculate the motifs from the fasta file

#!/usr/bin/perl

use strict;
use warnings;

my %seq = ();
my $current_id = '';
open(IN,'<','hg19_refgene_upstream_50_080312.fa') || die "Could not open hg19_refgene_upstream_50_080312.fa: $!\n";
while(<IN>){
   chomp;
   next if /^$/;
   if (/^>(.*)$/){
      $current_id = $1;
   } else {
      my $current_line = $_;
      $current_line = uc($current_line);
      $seq{$current_id} .= $current_line;
   }
}
close(IN);

my %motif = ();

foreach my $id (keys %seq){
   my $seq = $seq{$id};

   my $length = length($seq);
   foreach my $frame (0 .. 5){
      my $frame_length = $length - $frame;
      my $mod = $frame_length % 6;
      $frame_length -= $mod;
      for (my $i = $frame; $frame<$frame_length; $frame=$frame+6){
         my $mer = substr($seq,$frame,6);
         if (exists $motif{$mer}){
            $motif{$mer}++;
         } else {
            $motif{$mer} = 1;
         }
      }
   }
}

foreach my $mer (sort {$motif{$b} <=> $motif{$a}} keys %motif){
   print "$mer\t$motif{$mer}\n";
}

exit(0);

The top ten motifs were:

GGGCGG 7514
GGCGGG 7345
CCGCCC 6706
CCCGCC 6631
GGGGCG 6227
GCGGGG 6153
CGCCCC 5766
CCCCGC 5596
CGGGGC 4822
CCTCCC 4622

Now I'm going to repeat the analysis with only the 10 nucleotides upstream of the RefSeq genes

GGCGGG 985
GGGCGG 965
CCCGCC 867
CCGCCC 838
GCGGGG 774
CGCCCC 732
CCCCGC 709
GGGGCG 705
CGGGGC 661
CCTCCC 640

Almost the same.

See related post on RefSeq promoters.

Why miRNA are 22 or 23 nucleotides long

Late last year I mapped random sized DNA sequences back to the genome. The purpose was simply to see how long sequenced reads needed to be before they could be uniquely mapped to the genome. I couldn't find the statistics on this, so I just did it myself. I didn't dwell on the results too much.

One day, I heard someone say miRNA's have definitely evolved to be 22-23 nt long. As with everything in nature, there's a reason why things are the way they are. So just now, I decided to look back at my results.

The result to look at is the number of perfect matches vs. the length of the random DNA fragment. When I started to make random sized fragments larger than 20 nt long, very few of the random fragments could map uniquely. At 21 nt long, roughly 0.1% (982 out of 1_000_000) of the random DNA fragments mapped to the genome. At 22nt long, the number of mappable random DNA fragments drops to 255 out of 1_000_000 and at 23nt, 65 out of 1_000_000. 22-23 nt long DNA sequences seem like a good compromise between gaining sequence uniqueness with the shortest possible length, which is probably why they are this length. Even if there is a point mutation in a 22 nt miRNA, it will most likely still only bind to its intended target.

Making a line graph to depict timecourse data

From this helpful thread in the bioconductor mailing list.

x <- 1:50  ## these would be your genes
 set.seed(1)
 y <- matrix(rnorm(1e4), nc=200) ## this would be your gene expr matrix
 col <- rgb(190, 190, 190, alpha=60, maxColorValue=255)
 matplot(x, y, type='l', col=col)

Just to see what it is doing, I made a simpler example

#variable with 2 rows
one <- 1:2
#matrix with 2 rows and 10 columns
two <- matrix(rnorm(20), nc=10)
two
#           [,1]        [,2]      [,3]        [,4]       [,5]      [,6]      [,7]
#[1,] -0.6822078 -0.25108283  2.425193 -0.05766436 -2.6879801 0.3658529 -1.987125
#[2,] -1.0733665 -0.01278288 -1.403296 -0.13803471 -0.5859938 0.4553595 -1.395202
#          [,8]      [,9]      [,10]
#[1,] -0.281798 1.8890190  1.0734526
#[2,] -1.498866 0.1831983 -0.4870297
#plots each column in the matrix two, along x
matplot(one,two,type='l')

Column 5 of the matrix "two" can most easily be seen as the dotted aqua line (from -2.6879801 to -0.5859938).

This plot could be useful if you wanted to depict the gene expression of 50 genes at 10 timepoints in a timecourse experiment (make a matrix of 10 rows by 50 columns).