Clustering mapped reads

Updated 2014 October 8th to include an analysis using CAGE data from ENCODE and rewrote parts of the post.

In this post I will write about a read clustering method called paraclu, which allows mapped reads to be clustered together. This is particularly useful when working with CAGE data, where transcription start sites (TSSs) are known to initiate at different positions but are all providing information on the promoter activity of a transcript, so it is useful to cluster the TSSs together. In addition, paraclu allows different levels of clustering, so you can choose the level that you want. Furthermore, studying the clusters at different levels can reveal subtle properties of promoters; this is akin to adjusting the bin size of histograms to see if certain properties arise.

To get started, download and compile paraclu:

wget http://www.cbrc.jp/paraclu/paraclu-9.zip
unzip paraclu-9.zip
cd paraclu-9
make
#if we run paraclu it will tell us that
#it needs some input
paraclu
paraclu: I need a minValue and a fileName

The input to paraclu is a sorted four column file (the program recognises spaces or tabs):

chr1    +       17689   3

where the first column is the chromosome, the second is the strandedness, the third is the starting position, and the fourth is the number of reads at this starting position with the same strandedness. I created this test input file:

chr1    +       10000   1
chr1    +       10010   10
chr1    +       10020   100
chr1    +       10030   1000
chr1    +       10040   10000
chr1    +       10050   1000
chr1    +       10060   100
chr1    +       10070   10
chr1    +       10080   1

which I will now test with paraclu:

#the 1 is the minValue, which is
#the minimal number of reads for forming a cluster
paraclu 1 test.input
# sequence, strand, start, end, sites, sum of values, min d, max d
chr1    +       10000   10080   9       12222   -1e+100 0.1
chr1    +       10000   10000   1       1       0.1     1e+100
chr1    +       10010   10070   7       12220   0.1     1
chr1    +       10010   10010   1       10      1       1e+100
chr1    +       10020   10060   5       12200   1       10
chr1    +       10020   10020   1       100     10      1e+100
chr1    +       10030   10050   3       12000   10      100
chr1    +       10030   10030   1       1000    100     1e+100
chr1    +       10040   10040   1       10000   100     1e+100
chr1    +       10050   10050   1       1000    100     1e+100
chr1    +       10060   10060   1       100     10      1e+100
chr1    +       10070   10070   1       10      1       1e+100
chr1    +       10080   10080   1       1       0.1     1e+100

#if I increase the minValue,
#I don't get as many clusters
paraclu 1000 test.input
# sequence, strand, start, end, sites, sum of values, min d, max d
chr1    +       10000   10080   9       12222   -1e+100 0.1
chr1    +       10010   10070   7       12220   0.1     1
chr1    +       10020   10060   5       12200   1       10
chr1    +       10030   10050   3       12000   10      100
chr1    +       10030   10030   1       1000    100     1e+100
chr1    +       10040   10040   1       10000   100     1e+100
chr1    +       10050   10050   1       1000    100     1e+100

The results are presented as a eight column file, which the columns are:

sequence - the chromosome
strand - the strandedness
start - the start of the cluster
end - the end of the cluster
sites - the number of sites clustered together
sum of values - the sum of the sites
min d - the cluster's minimum density
max d - the cluster's maximum density

Firstly to understand d, we need to understand what a cluster or segment is. From the paper it states:

Clusters are defined to be maximal scoring segments of a chromosome, where the score is given by this formula: Score = (number of events in segment) - d x (segment size in nt).

For a discussion on maximal scoring segments, check out A Linear Time Algorithm for Finding All Maximal Scoring Subsequences, which has a nice example of using the maximal scoring approach. Their example was an application of this scoring approach to identifying transmembrane domains in proteins, which are rich in hydrophobic residues. There is a hydropathy index, which assigned scores ranging from least hydrophobic (-5) to most hydrophobic (+3) to the 20 amino acids. By examining subsequences within the protein for the highest total residue scores, they could identify the transmembrane domains of the protein.

For every maximal scoring segments, the minimum and maximum values of d are reported. If the range between the minimum and maximum values are large, meaning that a particular segment has maximal scoring over a large range of values for d, it is a stable cluster. This is also stated in the paraclu manual:

Briefly, the greater the fold-change between min and max density, the more prominent the cluster, and the less likely that it is due to chance fluctuations in the data.

Clustering BAM files

Let me demonstrate the clustering of several CAGE libraries, which have been mapped and stored as BAM files. Just as a note, the start and end positions in BAM files are identified using a one-based index.

wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeRikenCage/wgEncodeRikenCageHelas3CellPapAlnRep1.bam
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeRikenCage/wgEncodeRikenCageHelas3CellPapAlnRep2.bam
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeRikenCage/wgEncodeRikenCageHelas3CytosolPapAlnRep1.bam
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeRikenCage/wgEncodeRikenCageHelas3CytosolPapAlnRep2.bam
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeRikenCage/wgEncodeRikenCageHelas3NucleusPapAlnRep1.bam
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeRikenCage/wgEncodeRikenCageHelas3NucleusPapAlnRep2.bam

We need to prepare the paraclu input file from the BAM files. We can use the bamToBed tool from BEDTools to convert the BAM file into a BED. Of note here is that bedtools uses the UCSC Genome Browser’s internal database convention of making the start position 0-based and the end position 1-based. The input file required by paraclu requires only starting coordinates; however for reads mapping onto the negative strand it is unclear how this should be represented in the paraclu input file. I represent the start as the end coordinate minus one, to maintain the convention that starting positions are 0-based:

my_awk='{OFS="\t"; if($6=="+") print $1,$6,$2,1; else print $1,$6,$3-1,1}'
parallel --verbose "bamToBed -i {} | awk '$my_awk' > {.}.input" ::: *.bam
cat wg*.input > cage.input

At this stage the input file is not sorted and not summarised, i.e. each line represents a single read. We can use the groupBy tool from BEDTools to sum up reads that start at the same position and on the same strand:

cat cage.input | sort -k1,1 -k2,2 -k3,3n | groupBy -g 1,2,3 -c 4 -o sum > cage_sum.input
rm cage.input

Now let's run the read clustering step:

paraclu-9/paraclu 10 cage_sum.input > cage_sum.output

#check out the results
head cage_sum.output 
# sequence, strand, start, end, sites, sum of values, min d, max d
chr1    -       10377   249240488       484668  7106296 -1e+100 0.000142
chr1    -       10377   249212275       484664  7106292 0.000142        0.000227
chr1    -       14774   249212275       484663  7106291 0.000227        0.000503
chr1    -       14774   249200349       484658  7106285 0.000503        0.0011
chr1    -       14774   249170214       484631  7106252 0.0011  0.00133
chr1    -       14774   249167959       484628  7106249 0.00133 0.00155
chr1    -       14774   249153809       484612  7106227 0.00155 0.00347
chr1    -       15927   249153809       484608  7106223 0.00347 0.00509
chr1    -       15927   247495895       482521  7097789 0.00509 0.00581

If we examine the first line we can see that it is a single cluster than encapsulates all the signal on chromosome one. This is not very useful in studying the internal structure of the data set. Within the paraclu zip file, there is a shell script, called paraclu-cut.sh, which can help simplify the results.

head paraclu-cut.sh 
#! /bin/sh

# Copyright 2011, 2012 Martin C. Frith

# This script simplifies the output of paraclu.  It omits clusters
# that are too long, or are singletons.  Then, it removes clusters
# whose fold-increase in density is too small.  Finally, it removes
# clusters that are contained in larger clusters.

# This script assumes that the input is sorted by start position

The default settings of the script are that clusters longer than 200, max d/min d < 2, or contained within another cluster are removed. I made each of the test cases below:

#removed because max d / min d < 2
chr1    -       10377   10380   484668  7106296 2.01    4
#not removed
chr1    -       20000   20080   484664  7106292 2       4
#removed because it is longer than 200
chr1    -       20000   20201   484668  7106296 2.01    4
#not removed
chr1    -       30000   30080   484664  7106292 2       4
#removed because it is contained within the cluster above
chr1    -       30020   30060   484664  7106292 2       4

Now if we run the script, only two of the results will remain:

paraclu-9/paraclu-cut.sh test.output 
chr1    -       20000   20080   484664  7106292 2       4
chr1    -       30000   30080   484664  7106292 2       4

We can run this script on our results:

paraclu-9/paraclu-cut.sh cage_sum.output > cage_sum_simplified.output

head cage_sum_simplified.output
chr1    -       16330   16332   2       23      0.113   0.5
chr1    -       17469   17497   7       12      0.0291  0.0952
chr1    -       19765   19809   11      15      0.0173  0.0833
chr1    -       21570   21615   11      24      0.0133  0.143
chr1    -       29262   29392   57      8171    0.0161  0.143
chr1    -       88202   88385   11      13      0.0154  0.0385
chr1    -       564518  564519  2       32      1.44    14
chr1    -       564532  564533  2       12      1.44    4
chr1    -       564644  564645  2       11      3       5
chr1    -       564686  564802  79      1686    4.86    6

#convert the paraclu output file into bed
cat cage_sum_simplified.output | awk 'OFS="\t" {print $1, $3, $4, $1"_"$3"_"$4"_"$2, $6, $2}' | sort -k1,1V -k2,2n -k6,6 > cage_sum_simplified.bed

Note that using the script will remove clusters within clusters, which may be something you want to preserve. Also what we're missing here is the contribution of reads from each individual library to the cluster; I tested various approaches using BEDTools but in the end it was just easier to use some scripts that I wrote.

#!/usr/bin/env perl

use strict;
use warnings;

my $usage = "Usage: $0 <infile.input> <infile.bed>\n";
my $infile = shift or die $usage;
my $bed = shift or die $usage;

my $bin_size = 200;

my %bed = ();
open(IN,'<',$bed) || die "Could not open $bed: $!\n";
while(<IN>){
   chomp;
   my ($chr, $start, $end, $id, $score, $strand) = split(/\t/);
   my $which_bin = int($start / $bin_size);
   #print "$which_bin\t$_\n";
   my $key = join(":", $chr, $start, $end, $id, $score, $strand);
   $bed{$chr}{$strand}{$which_bin}{$key} = 1;
}
close(IN);

my %result = ();
open(IN,'<',$infile) || die "Could not open $infile; $!\n";
while(<IN>){
   chomp;
   #chr1    +       10088   1
   my ($c, $str, $sta, $count) = split(/\t/);
   my $which_bin = int($sta / $bin_size);

   #check bin before and after
   #incase the coordinates span a bin
   for ($which_bin - 1 .. $which_bin + 1){
      my $current_bin = $_;
      if (exists $bed{$c}{$str}{$current_bin}){
         foreach my $key (keys $bed{$c}{$str}{$current_bin}){
            my ($chr, $start, $end, $id, $score, $strand) = split(/:/,$key);
            if ($sta >= $start && $sta <= $end){
               if (exists $result{$key}){
                  $result{$key}++;
               } else {
                  $result{$key} = 1;
               }
            }
         }
      }
   }
}
close(IN);

foreach my $key (keys %result){
   my ($chr, $start, $end, $id, $score, $strand) = split(/:/,$key);
   print join("\t", $chr, $start, $end, 0, $score, $strand, $result{$key}),"\n";
}

exit(0);

__END__

Now let's run the script:

parallel --verbose "count.pl {} cage_sum_simplified.bed > {.}.count" ::: wg*.input

This script below will join the counts into one file:

#!/bin/env perl

use strict;
use warnings;

my $usage = "Usage: $0 <infile.bed> <directory>\n";
my $bed = shift or die $usage;
my $dir = shift or die $usage;

my %count = ();
my @file = ();

opendir(DIR,$dir) || die "Could not open $dir: $!\n";
while(my $infile = readdir(DIR)){
   next unless $infile =~ /\.count$/;
   push(@file, $infile);
   open(IN,'<',$infile) || die "Could not open $infile: $!\n";
   while(<IN>){
      chomp;
      #chr9    132655887       132655891       0  11   -       2
      my ($chr, $start, $end, $junk, $count, $strand, $tally) = split('\t');
      my $id = "${chr}_${start}_${end}_${strand}";
      $count{$id}{$infile} = $tally;
   }
   close(IN);
}
closedir(DIR);

#header
print join("\t",'chr','start','end','id','score','strand',@file),"\n";

open(IN,'<',$bed) || die "Could not open $bed: $!\n";
while(<IN>){
   chomp;
   my ($chr, $start, $end, $junk, $count, $strand) = split('\t');
   my $id = "${chr}_${start}_${end}_${strand}";
   my $line_to_print = "$_\t";
   my $total = 0;
   if (exists $count{$id}){
      foreach my $infile (@file){
         if (exists $count{$id}{$infile}){
            $line_to_print .= "$count{$id}{$infile}\t";
            $total += $count{$id}{$infile};
         } else {
            $line_to_print .= "0\t";
         }
      }
   } else {
      die "Error missing count for $id\n";
   }
   $line_to_print =~ s/\t$//;
   #checkpoint
   die unless $total == $count;
   print "$line_to_print\n";
}
close(IN);

exit(0);

Joining all the counts:

tally.pl cage_sum_simplified.bed . > cage_sum_simplified.tally

#voila
head cage_sum_simplified.tally 
chr     start   end     id      score   strand  wgEncodeRikenCageHelas3CellPapAlnRep1.count     wgEncodeRikenCageHelas3CellPapAlnRep2.count     wgEncodeRikenCageHelas3CytosolPapAlnRep1.count  wgEncodeRikenCageHelas3CytosolPapAlnRep2.count
        wgEncodeRikenCageHelas3NucleusPapAlnRep1.count  wgEncodeRikenCageHelas3NucleusPapAlnRep2.count
chr1    16330   16332   chr1_16330_16332_-      23      -       0       1       0       0       5       17
chr1    17469   17497   chr1_17469_17497_-      12      -       1       0       0       0       7       4
chr1    19765   19809   chr1_19765_19809_-      15      -       2       1       0       1       0       11
chr1    21570   21615   chr1_21570_21615_-      24      -       3       3       0       0       1       17
chr1    29262   29392   chr1_29262_29392_-      8171    -       723     850     414     702     2121    3361
chr1    88202   88385   chr1_88202_88385_-      13      -       2       1       0       1       5       4
chr1    534298  534305  chr1_534298_534305_+    16      +       1       2       2       1       7       3
chr1    564466  564468  chr1_564466_564468_+    152     +       77      28      27      14      4       2
chr1    564518  564519  chr1_564518_564519_-    32      -       17      6       7       2       0       0 

Finding the peak:

#!/bin/env perl

use strict;
use warnings;

my $usage = "Usage: $0 <paraclu.input> <file.bed>\n";
my $infile = shift or die $usage;
my $bed = shift or die $usage;

my %store = ();

open(IN,'<',$infile) || die "Could not open $infile: $!\n";
while(<IN>){
   chomp;
   #chr1    -       16605   27
   my ($chr, $strand, $start, $count) = split(/\t/);
   $store{$strand}{$chr}{$start} = $count;
}
close(IN);

open(IN,'<',$bed) || die "Could not open $bed: $!\n";
while(<IN>){
   chomp;
   my ($chr, $start, $end, $junk, $count, $strand) = split('\t');
   my $max = 0;
   my @pos = ();
   for ($start .. $end){
      if (exists $store{$strand}{$chr}{$_}){
         my $count = $store{$strand}{$chr}{$_};
         if ($count > $max){
            $max = $count;
            @pos = ();
            $pos[0] = $_;
         } elsif ($count == $max){
            push(@pos, $_);
         }
      }
   }
   my $peak = join(',', @pos);
   print join("\t", $_, $max, $peak),"\n";
}
close(IN);

exit(0);

Running the code:

peak.pl cage_sum.input cage_sum_simplified.bed | head
chr1    16330   16332   chr1_16330_16332_-      23      -       22      16330
chr1    17469   17497   chr1_17469_17497_-      12      -       4       17496
chr1    19765   19809   chr1_19765_19809_-      15      -       3       19780
chr1    21570   21615   chr1_21570_21615_-      24      -       9       21577
chr1    29262   29392   chr1_29262_29392_-      8171    -       3733    29347
chr1    88202   88385   chr1_88202_88385_-      13      -       2       88384,88385
chr1    534298  534305  chr1_534298_534305_+    16      +       6       534301
chr1    564466  564468  chr1_564466_564468_+    152     +       117     564467
chr1    564518  564519  chr1_564518_564519_-    32      -       18      564519
chr1    564532  564533  chr1_564532_564533_-    12      -       8       564533

Code for this post is available at https://github.com/davetang/paraclu_prep.

Print Friendly, PDF & Email



Creative Commons License
This work is licensed under a Creative Commons
Attribution 4.0 International License
.
4 comments Add yours
  1. Hi Dave,

    I wanted to convert the BAM files into CTSS.bed files, where i am interested only on the first-base (like FANTOM5) instead of the whole tag-length (like the current ENCODE files).

    This is how i did

    wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeRikenCage/wgEncodeRikenCageHelas3CellPapAlnRep1.bam

    bamToBed -i wgEncodeRikenCageHelas3CellPapAlnRep1.bam > wgEncodeRikenCageHelas3CellPapAlnRep1.bed

    # BED- 0-based start and 1-based end

    cat wgEncodeRikenCageHelas3CellPapAlnRep1.bed | awk 'BEGIN {OFS="\t"} { if ($6 == "+") { print $1,$2,$2+1,"HelaS3Cell",$5,$6 } else if ($6 == "-") { print $1,$3-1,$3,"HelaS3Cell",$5,$6 } }' \
    | awk 'BEGIN {OFS="\t"} { A[$1":"$2":"$3":"$4":"$5":"$6]++ } END { for(i in A) { print i"\t"A[i] }}' | sed 's/:/\t/g' | sort -k1,1 -k2,2g \
    | awk 'BEGIN {OFS="\t"} { print $1,$2,$3,$4,"Ctss"FNR,$6,$7 }' > wgEncodeRikenCageHelas3CellPapAlnRep1_Ctss.bed

    What do you say of it?
    I can see you have converted both TSS into 0-based, before inputting into paraclu.

    Thanks !

  2. Hi Dave,
    I tried to download the paraclu using wget http://www.cbrc.jp/paraclu/paraclu-9.zip,
    but generates error:
    Connecting to http://www.cbrc.jp|153.126.155.15|:443... connected.
    ERROR: certificate common name “www.ncrna.org” doesn’t match requested host name “www.cbrc.jp”.
    To connect to http://www.cbrc.jp insecurely, use ‘--no-check-certificate’.

    When I tried:
    wget --no-check-certificate http://www.cbrc.jp/paraclu/paraclu-9.zip
    Only an index.html file is downloaded.

    Do you know how to download the paraclu now?

    Thanks a lot!

    Do you

Leave a Reply

Your email address will not be published. Required fields are marked *