Updated 2014 October 8th to include an analysis using CAGE data from ENCODE and rewrote parts of the post.
In this post I will write about a read clustering method called paraclu, which allows mapped reads to be clustered together. This is particularly useful when working with CAGE data, where transcription start sites (TSSs) are known to initiate at different positions but are all providing information on the promoter activity of a transcript, so it is useful to cluster the TSSs together. In addition, paraclu allows different levels of clustering, so you can choose the level that you want. Furthermore, studying the clusters at different levels can reveal subtle properties of promoters; this is akin to adjusting the bin size of histograms to see if certain properties arise.
To get started, download and compile paraclu:
wget https://davetang.org/file/paraclu-9.zip unzip paraclu-9.zip cd paraclu-9 make #if we run paraclu it will tell us that #it needs some input paraclu paraclu: I need a minValue and a fileName
The input to paraclu is a sorted four column file (the program recognises spaces or tabs):
chr1 + 17689 3
where the first column is the chromosome, the second is the strandedness, the third is the starting position, and the fourth is the number of reads at this starting position with the same strandedness. I created this test input file:
chr1 + 10000 1 chr1 + 10010 10 chr1 + 10020 100 chr1 + 10030 1000 chr1 + 10040 10000 chr1 + 10050 1000 chr1 + 10060 100 chr1 + 10070 10 chr1 + 10080 1
which I will now test with paraclu:
#the 1 is the minValue, which is #the minimal number of reads for forming a cluster paraclu 1 test.input # sequence, strand, start, end, sites, sum of values, min d, max d chr1 + 10000 10080 9 12222 -1e+100 0.1 chr1 + 10000 10000 1 1 0.1 1e+100 chr1 + 10010 10070 7 12220 0.1 1 chr1 + 10010 10010 1 10 1 1e+100 chr1 + 10020 10060 5 12200 1 10 chr1 + 10020 10020 1 100 10 1e+100 chr1 + 10030 10050 3 12000 10 100 chr1 + 10030 10030 1 1000 100 1e+100 chr1 + 10040 10040 1 10000 100 1e+100 chr1 + 10050 10050 1 1000 100 1e+100 chr1 + 10060 10060 1 100 10 1e+100 chr1 + 10070 10070 1 10 1 1e+100 chr1 + 10080 10080 1 1 0.1 1e+100 #if I increase the minValue, #I don't get as many clusters paraclu 1000 test.input # sequence, strand, start, end, sites, sum of values, min d, max d chr1 + 10000 10080 9 12222 -1e+100 0.1 chr1 + 10010 10070 7 12220 0.1 1 chr1 + 10020 10060 5 12200 1 10 chr1 + 10030 10050 3 12000 10 100 chr1 + 10030 10030 1 1000 100 1e+100 chr1 + 10040 10040 1 10000 100 1e+100 chr1 + 10050 10050 1 1000 100 1e+100
The results are presented as a eight column file, which the columns are:
sequence - the chromosome
strand - the strandedness
start - the start of the cluster
end - the end of the cluster
sites - the number of sites clustered together
sum of values - the sum of the sites
min d - the cluster's minimum density
max d - the cluster's maximum density
Firstly to understand d, we need to understand what a cluster or segment is. From the paper it states:
Clusters are defined to be maximal scoring segments of a chromosome, where the score is given by this formula: Score = (number of events in segment) - d x (segment size in nt).
For a discussion on maximal scoring segments, check out A Linear Time Algorithm for Finding All Maximal Scoring Subsequences, which has a nice example of using the maximal scoring approach. Their example was an application of this scoring approach to identifying transmembrane domains in proteins, which are rich in hydrophobic residues. There is a hydropathy index, which assigned scores ranging from least hydrophobic (-5) to most hydrophobic (+3) to the 20 amino acids. By examining subsequences within the protein for the highest total residue scores, they could identify the transmembrane domains of the protein.
For every maximal scoring segments, the minimum and maximum values of d are reported. If the range between the minimum and maximum values are large, meaning that a particular segment has maximal scoring over a large range of values for d, it is a stable cluster. This is also stated in the paraclu manual:
Briefly, the greater the fold-change between min and max density, the more prominent the cluster, and the less likely that it is due to chance fluctuations in the data.
Clustering BAM files
Let me demonstrate the clustering of several CAGE libraries, which have been mapped and stored as BAM files. Just as a note, the start and end positions in BAM files are identified using a one-based index.
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeRikenCage/wgEncodeRikenCageHelas3CellPapAlnRep1.bam wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeRikenCage/wgEncodeRikenCageHelas3CellPapAlnRep2.bam wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeRikenCage/wgEncodeRikenCageHelas3CytosolPapAlnRep1.bam wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeRikenCage/wgEncodeRikenCageHelas3CytosolPapAlnRep2.bam wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeRikenCage/wgEncodeRikenCageHelas3NucleusPapAlnRep1.bam wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeRikenCage/wgEncodeRikenCageHelas3NucleusPapAlnRep2.bam
We need to prepare the paraclu input file from the BAM files. We can use the bamToBed tool from BEDTools to convert the BAM file into a BED. Of note here is that bedtools uses the UCSC Genome Browser’s internal database convention of making the start position 0-based and the end position 1-based. The input file required by paraclu requires only starting coordinates; however for reads mapping onto the negative strand it is unclear how this should be represented in the paraclu input file. I represent the start as the end coordinate minus one, to maintain the convention that starting positions are 0-based:
my_awk='{OFS="\t"; if($6=="+") print $1,$6,$2,1; else print $1,$6,$3-1,1}'
parallel --verbose "bamToBed -i {} | awk '$my_awk' > {.}.input" ::: *.bam
cat wg*.input > cage.input
At this stage the input file is not sorted and not summarised, i.e. each line represents a single read. We can use the groupBy tool from BEDTools to sum up reads that start at the same position and on the same strand:
cat cage.input | sort -k1,1 -k2,2 -k3,3n | groupBy -g 1,2,3 -c 4 -o sum > cage_sum.input rm cage.input
Now let's run the read clustering step:
paraclu-9/paraclu 10 cage_sum.input > cage_sum.output #check out the results head cage_sum.output # sequence, strand, start, end, sites, sum of values, min d, max d chr1 - 10377 249240488 484668 7106296 -1e+100 0.000142 chr1 - 10377 249212275 484664 7106292 0.000142 0.000227 chr1 - 14774 249212275 484663 7106291 0.000227 0.000503 chr1 - 14774 249200349 484658 7106285 0.000503 0.0011 chr1 - 14774 249170214 484631 7106252 0.0011 0.00133 chr1 - 14774 249167959 484628 7106249 0.00133 0.00155 chr1 - 14774 249153809 484612 7106227 0.00155 0.00347 chr1 - 15927 249153809 484608 7106223 0.00347 0.00509 chr1 - 15927 247495895 482521 7097789 0.00509 0.00581
If we examine the first line we can see that it is a single cluster than encapsulates all the signal on chromosome one. This is not very useful in studying the internal structure of the data set. Within the paraclu zip file, there is a shell script, called paraclu-cut.sh, which can help simplify the results.
head paraclu-cut.sh #! /bin/sh # Copyright 2011, 2012 Martin C. Frith # This script simplifies the output of paraclu. It omits clusters # that are too long, or are singletons. Then, it removes clusters # whose fold-increase in density is too small. Finally, it removes # clusters that are contained in larger clusters. # This script assumes that the input is sorted by start position
The default settings of the script are that clusters longer than 200, max d/min d < 2, or contained within another cluster are removed. I made each of the test cases below:
#removed because max d / min d < 2 chr1 - 10377 10380 484668 7106296 2.01 4 #not removed chr1 - 20000 20080 484664 7106292 2 4 #removed because it is longer than 200 chr1 - 20000 20201 484668 7106296 2.01 4 #not removed chr1 - 30000 30080 484664 7106292 2 4 #removed because it is contained within the cluster above chr1 - 30020 30060 484664 7106292 2 4
Now if we run the script, only two of the results will remain:
paraclu-9/paraclu-cut.sh test.output chr1 - 20000 20080 484664 7106292 2 4 chr1 - 30000 30080 484664 7106292 2 4
We can run this script on our results:
paraclu-9/paraclu-cut.sh cage_sum.output > cage_sum_simplified.output
head cage_sum_simplified.output
chr1 - 16330 16332 2 23 0.113 0.5
chr1 - 17469 17497 7 12 0.0291 0.0952
chr1 - 19765 19809 11 15 0.0173 0.0833
chr1 - 21570 21615 11 24 0.0133 0.143
chr1 - 29262 29392 57 8171 0.0161 0.143
chr1 - 88202 88385 11 13 0.0154 0.0385
chr1 - 564518 564519 2 32 1.44 14
chr1 - 564532 564533 2 12 1.44 4
chr1 - 564644 564645 2 11 3 5
chr1 - 564686 564802 79 1686 4.86 6
#convert the paraclu output file into bed
cat cage_sum_simplified.output | awk 'OFS="\t" {print $1, $3, $4, $1"_"$3"_"$4"_"$2, $6, $2}' | sort -k1,1V -k2,2n -k6,6 > cage_sum_simplified.bed
Note that using the script will remove clusters within clusters, which may be something you want to preserve. Also what we're missing here is the contribution of reads from each individual library to the cluster; I tested various approaches using BEDTools but in the end it was just easier to use some scripts that I wrote.
#!/usr/bin/env perl
use strict;
use warnings;
my $usage = "Usage: $0 <infile.input> <infile.bed>\n";
my $infile = shift or die $usage;
my $bed = shift or die $usage;
my $bin_size = 200;
my %bed = ();
open(IN,'<',$bed) || die "Could not open $bed: $!\n";
while(<IN>){
chomp;
my ($chr, $start, $end, $id, $score, $strand) = split(/\t/);
my $which_bin = int($start / $bin_size);
#print "$which_bin\t$_\n";
my $key = join(":", $chr, $start, $end, $id, $score, $strand);
$bed{$chr}{$strand}{$which_bin}{$key} = 1;
}
close(IN);
my %result = ();
open(IN,'<',$infile) || die "Could not open $infile; $!\n";
while(<IN>){
chomp;
#chr1 + 10088 1
my ($c, $str, $sta, $count) = split(/\t/);
my $which_bin = int($sta / $bin_size);
#check bin before and after
#incase the coordinates span a bin
for ($which_bin - 1 .. $which_bin + 1){
my $current_bin = $_;
if (exists $bed{$c}{$str}{$current_bin}){
foreach my $key (keys $bed{$c}{$str}{$current_bin}){
my ($chr, $start, $end, $id, $score, $strand) = split(/:/,$key);
if ($sta >= $start && $sta <= $end){
if (exists $result{$key}){
$result{$key}++;
} else {
$result{$key} = 1;
}
}
}
}
}
}
close(IN);
foreach my $key (keys %result){
my ($chr, $start, $end, $id, $score, $strand) = split(/:/,$key);
print join("\t", $chr, $start, $end, 0, $score, $strand, $result{$key}),"\n";
}
exit(0);
__END__
Now let's run the script:
parallel --verbose "count.pl {} cage_sum_simplified.bed > {.}.count" ::: wg*.input
This script below will join the counts into one file:
#!/bin/env perl
use strict;
use warnings;
my $usage = "Usage: $0 <infile.bed> <directory>\n";
my $bed = shift or die $usage;
my $dir = shift or die $usage;
my %count = ();
my @file = ();
opendir(DIR,$dir) || die "Could not open $dir: $!\n";
while(my $infile = readdir(DIR)){
next unless $infile =~ /\.count$/;
push(@file, $infile);
open(IN,'<',$infile) || die "Could not open $infile: $!\n";
while(<IN>){
chomp;
#chr9 132655887 132655891 0 11 - 2
my ($chr, $start, $end, $junk, $count, $strand, $tally) = split('\t');
my $id = "${chr}_${start}_${end}_${strand}";
$count{$id}{$infile} = $tally;
}
close(IN);
}
closedir(DIR);
#header
print join("\t",'chr','start','end','id','score','strand',@file),"\n";
open(IN,'<',$bed) || die "Could not open $bed: $!\n";
while(<IN>){
chomp;
my ($chr, $start, $end, $junk, $count, $strand) = split('\t');
my $id = "${chr}_${start}_${end}_${strand}";
my $line_to_print = "$_\t";
my $total = 0;
if (exists $count{$id}){
foreach my $infile (@file){
if (exists $count{$id}{$infile}){
$line_to_print .= "$count{$id}{$infile}\t";
$total += $count{$id}{$infile};
} else {
$line_to_print .= "0\t";
}
}
} else {
die "Error missing count for $id\n";
}
$line_to_print =~ s/\t$//;
#checkpoint
die unless $total == $count;
print "$line_to_print\n";
}
close(IN);
exit(0);
Joining all the counts:
tally.pl cage_sum_simplified.bed . > cage_sum_simplified.tally
#voila
head cage_sum_simplified.tally
chr start end id score strand wgEncodeRikenCageHelas3CellPapAlnRep1.count wgEncodeRikenCageHelas3CellPapAlnRep2.count wgEncodeRikenCageHelas3CytosolPapAlnRep1.count wgEncodeRikenCageHelas3CytosolPapAlnRep2.count
wgEncodeRikenCageHelas3NucleusPapAlnRep1.count wgEncodeRikenCageHelas3NucleusPapAlnRep2.count
chr1 16330 16332 chr1_16330_16332_- 23 - 0 1 0 0 5 17
chr1 17469 17497 chr1_17469_17497_- 12 - 1 0 0 0 7 4
chr1 19765 19809 chr1_19765_19809_- 15 - 2 1 0 1 0 11
chr1 21570 21615 chr1_21570_21615_- 24 - 3 3 0 0 1 17
chr1 29262 29392 chr1_29262_29392_- 8171 - 723 850 414 702 2121 3361
chr1 88202 88385 chr1_88202_88385_- 13 - 2 1 0 1 5 4
chr1 534298 534305 chr1_534298_534305_+ 16 + 1 2 2 1 7 3
chr1 564466 564468 chr1_564466_564468_+ 152 + 77 28 27 14 4 2
chr1 564518 564519 chr1_564518_564519_- 32 - 17 6 7 2 0 0
Finding the peak:
#!/bin/env perl
use strict;
use warnings;
my $usage = "Usage: $0 <paraclu.input> <file.bed>\n";
my $infile = shift or die $usage;
my $bed = shift or die $usage;
my %store = ();
open(IN,'<',$infile) || die "Could not open $infile: $!\n";
while(<IN>){
chomp;
#chr1 - 16605 27
my ($chr, $strand, $start, $count) = split(/\t/);
$store{$strand}{$chr}{$start} = $count;
}
close(IN);
open(IN,'<',$bed) || die "Could not open $bed: $!\n";
while(<IN>){
chomp;
my ($chr, $start, $end, $junk, $count, $strand) = split('\t');
my $max = 0;
my @pos = ();
for ($start .. $end){
if (exists $store{$strand}{$chr}{$_}){
my $count = $store{$strand}{$chr}{$_};
if ($count > $max){
$max = $count;
@pos = ();
$pos[0] = $_;
} elsif ($count == $max){
push(@pos, $_);
}
}
}
my $peak = join(',', @pos);
print join("\t", $_, $max, $peak),"\n";
}
close(IN);
exit(0);
Running the code:
peak.pl cage_sum.input cage_sum_simplified.bed | head chr1 16330 16332 chr1_16330_16332_- 23 - 22 16330 chr1 17469 17497 chr1_17469_17497_- 12 - 4 17496 chr1 19765 19809 chr1_19765_19809_- 15 - 3 19780 chr1 21570 21615 chr1_21570_21615_- 24 - 9 21577 chr1 29262 29392 chr1_29262_29392_- 8171 - 3733 29347 chr1 88202 88385 chr1_88202_88385_- 13 - 2 88384,88385 chr1 534298 534305 chr1_534298_534305_+ 16 + 6 534301 chr1 564466 564468 chr1_564466_564468_+ 152 + 117 564467 chr1 564518 564519 chr1_564518_564519_- 32 - 18 564519 chr1 564532 564533 chr1_564532_564533_- 12 - 8 564533
Code for this post is available at https://github.com/davetang/paraclu_prep.

This work is licensed under a Creative Commons
Attribution 4.0 International License.
Hi Dave,
I wanted to convert the BAM files into CTSS.bed files, where i am interested only on the first-base (like FANTOM5) instead of the whole tag-length (like the current ENCODE files).
This is how i did
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeRikenCage/wgEncodeRikenCageHelas3CellPapAlnRep1.bam
bamToBed -i wgEncodeRikenCageHelas3CellPapAlnRep1.bam > wgEncodeRikenCageHelas3CellPapAlnRep1.bed
# BED- 0-based start and 1-based end
cat wgEncodeRikenCageHelas3CellPapAlnRep1.bed | awk ‘BEGIN {OFS=”\t”} { if ($6 == “+”) { print $1,$2,$2+1,”HelaS3Cell”,$5,$6 } else if ($6 == “-“) { print $1,$3-1,$3,”HelaS3Cell”,$5,$6 } }’ \
| awk ‘BEGIN {OFS=”\t”} { A[$1″:”$2″:”$3″:”$4″:”$5″:”$6]++ } END { for(i in A) { print i”\t”A[i] }}’ | sed ‘s/:/\t/g’ | sort -k1,1 -k2,2g \
| awk ‘BEGIN {OFS=”\t”} { print $1,$2,$3,$4,”Ctss”FNR,$6,$7 }’ > wgEncodeRikenCageHelas3CellPapAlnRep1_Ctss.bed
What do you say of it?
I can see you have converted both TSS into 0-based, before inputting into paraclu.
Thanks !
The first bit looks fine. I’m not sure what the second bit (from the awk BEGIN) is trying to achieve though.
Hi Dave,
I tried to download the paraclu using wget http://www.cbrc.jp/paraclu/paraclu-9.zip,
but generates error:
Connecting to http://www.cbrc.jp|153.126.155.15|:443… connected.
ERROR: certificate common name “www.ncrna.org” doesn’t match requested host name “www.cbrc.jp”.
To connect to http://www.cbrc.jp insecurely, use ‘–no-check-certificate’.
When I tried:
wget –no-check-certificate http://www.cbrc.jp/paraclu/paraclu-9.zip
Only an index.html file is downloaded.
Do you know how to download the paraclu now?
Thanks a lot!
Do you
Here you go: