Using the ENCODE ChIA-PET dataset

Updated: 2014 March 14th

From the Wikipedia article:

Chromatin Interaction Analysis by Paired-End Tag Sequencing (ChIA-PET) is a technique that incorporates chromatin immunoprecipitation (ChIP)-based enrichment, chromatin proximity ligation, Paired-End Tags, and High-throughput sequencing to determine de novo long-range chromatin interactions genome-wide.

Let's get started on using the ENCODE ChIA-PET dataset by downloading the bed files, which has the interactions:

wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeGisChiaPet/wgEncodeGisChiaPetK562Pol2InteractionsRep1.bed.gz
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeGisChiaPet/wgEncodeGisChiaPetK562Pol2InteractionsRep2.bed.gz

#get others if you want
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeGisChiaPet/wgEncodeGisChiaPetHct116Pol2InteractionsRep1.bed.gz
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeGisChiaPet/wgEncodeGisChiaPetHelas3Pol2InteractionsRep1.bed.gz
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeGisChiaPet/wgEncodeGisChiaPetK562CtcfInteractionsRep1.bed.gz
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeGisChiaPet/wgEncodeGisChiaPetMcf7CtcfInteractionsRep1.bed.gz
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeGisChiaPet/wgEncodeGisChiaPetMcf7CtcfInteractionsRep2.bed.gz
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeGisChiaPet/wgEncodeGisChiaPetMcf7EraaInteractionsRep1.bed.gz
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeGisChiaPet/wgEncodeGisChiaPetMcf7EraaInteractionsRep2.bed.gz
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeGisChiaPet/wgEncodeGisChiaPetMcf7EraaInteractionsRep3.bed.gz
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeGisChiaPet/wgEncodeGisChiaPetMcf7Pol2InteractionsRep1.bed.gz
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeGisChiaPet/wgEncodeGisChiaPetMcf7Pol2InteractionsRep2.bed.gz
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeGisChiaPet/wgEncodeGisChiaPetMcf7Pol2InteractionsRep3.bed.gz
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeGisChiaPet/wgEncodeGisChiaPetMcf7Pol2InteractionsRep4.bed.gz
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeGisChiaPet/wgEncodeGisChiaPetNb4Pol2InteractionsRep1.bed.gz

For a preview and description of these interaction bed files, have a look at the table schema, which has the definition:

ChIA-PET Chromatin Interaction PET clusters: Two different genomic regions in the chromatin are genomically far from each other or in different chromosomes, but are spatially close to each other in the nucleus and interact with each other for regulatory functions. BED12 format is used to represent the data.

One cool way of visualising these chromtain interactions would be by using Circos (instructions for installing circos). Here's a simple Perl script to parse the bed12 files and prepare them in a format readable by Circos:

#!/bin/env perl

use strict;
use warnings;

my $usage = "Usage: $0 <infile.bed>\n";
my $infile = shift or die $usage;

my %check = ();

open(IN,'<', $infile) || die "Could not open $infile: $!\n";
while(<IN>){
   chomp;
   #chr1    875400  877590  chr1:875400..877590-chr8:126185955..126188744,2 200     .       875400  877590  255,0,0 1       2190    0
   my($chr, $start, $end, $name, @rest) = split();
   if ($name =~ /^(chr[0-9xym]+):(\d+)\.\.(\d+)-(chr[0-9xym]+):(\d+)\.\.(\d+),\d+/i){

      #inter-chromosomal interactions are on two lines
      #skip the duplicated line
      if (exists $check{$name}){
         next;
      } else {
         $check{$name} = 1;
      }

      my $chr_first = $1;
      my $start_first = $2;
      my $end_first = $3;
      my $chr_second = $4;
      my $start_second = $5;
      my $end_second = $6;
      $chr_second =~ s/chr/hs/;
      $chr_first =~ s/chr/hs/;
      $chr_first = lc($chr_first);
      $chr_second = lc($chr_second);
      print join (" ", $chr_first, $start_first, $end_first, $chr_second, $start_second, $end_second),"\n";
   } else {
      die "Could not parse $name\n";
   }
}
close(IN);
exit(0);

Now to execute the script:

./to_circos.pl wgEncodeGisChiaPetK562Pol2InteractionsRep1.bed > wgEncodeGisChiaPetK562Pol2InteractionsRep1.link

I have these three files prepared for Circos (ideogram.conf, ticks.conf and test.conf). See my getting started with Circos post to get more information on these configuration files.

#ideogram.conf
cat ideogram.conf
<ideogram>

<spacing>
default = 0.005r
</spacing>

# Ideogram position, fill and outline
radius           = 0.90r
thickness        = 20p
fill             = yes
stroke_color     = dgrey
stroke_thickness = 2p

# Minimum definition for ideogram labels.

show_label       = yes
# see etc/fonts.conf for list of font names
label_font       = default
label_radius     = dims(image,radius) - 60p
label_size       = 30
label_parallel   = yes
</ideogram>

#ticks.conf
cat ticks.conf
show_ticks          = yes
show_tick_labels    = yes

<ticks>
radius           = 1r
color            = black
thickness        = 2p

# the tick label is derived by multiplying the tick position
# by 'multiplier' and casting it in 'format':
#
# sprintf(format,position*multiplier)
#

multiplier       = 1e-6

# %d   - integer
# %f   - float
# %.1f - float with one decimal
# %.2f - float with two decimals
#
# for other formats, see http://perldoc.perl.org/functions/sprintf.html

format           = %d

<tick>
spacing        = 5u
size           = 10p
</tick>

<tick>
spacing        = 25u
size           = 15p
show_label     = yes
label_size     = 20p
label_offset   = 10p
format         = %d
</tick>

</ticks>

#test.conf
cat test.conf  | grep -v "^#" | grep -v "^$"
karyotype = data/karyotype/karyotype.human.txt
chromosomes_units = 1000000
<links>
<link>
file          = wgEncodeGisChiaPetK562Pol2InteractionsRep1.link
radius        = 0.8r
bezier_radius = 0r
color         = black_a4
thickness     = 2
<rules>
<rule>
condition     = var(intrachr)
show          = no
</rule>
<rule>
condition     = 1
color         = eval(var(chr2))
flow          = continue
</rule>
<rule>
condition     = to(hs1)
radius2       = 0.99r
</rule>
</rules>
</link>
</links>
<<include ideogram.conf>>
<<include ticks.conf>>
<image>
<<include etc/image.conf>>
</image>
<<include etc/colors_fonts_patterns.conf>>
<<include etc/housekeeping.conf>>

#now run Circos assuming you have the link file in the same directory as the conf files
bin/circos -conf test.conf

If everything worked perfectly, you should get this image:

circosThere's a whole new level of complexity when we take the spatial organisation of chromosomes into account as well.

Intra-chromosomal

If we want to focus on chromosome one and show long range interactions (over 1 mb):

#how many intra-chromosomal interactions on chromosome 1
cat wgEncodeGisChiaPetK562Pol2InteractionsRep1.link | awk '$1=="hs1" && $4=="hs1" && $5-$2>1000000 {print}' | wc -l
164
cat wgEncodeGisChiaPetK562Pol2InteractionsRep1.link | awk '$1=="hs1" && $4=="hs1" && $5-$2>1000000 {print}' > chr1_to_chr1.link

The ticks.conf and ideogram.conf are the same. Here's what the test.conf file looks like:

cat test.conf 
karyotype = data/karyotype/karyotype.human.txt
chromosomes_units = 1000000

chromosomes_display_default = no
chromosomes                 = hs1

<links>
<link>
file          = chr1_to_chr1.link
radius        = 0.8r
bezier_radius = 0r
color         = black_a4
thickness     = 2
<rules>
<rule>
condition     = var(intrachr)
show          = yes
</rule>
<rule>
condition     = 1
color         = eval(var(chr2))
flow          = continue
</rule>
<rule>
condition     = to(hs1)
radius2       = 0.99r
</rule>
</rules>
</link>
</links>
<<include ideogram.conf>>
<<include ticks.conf>>
<image>
<<include etc/image.conf>>
</image>
<<include etc/colors_fonts_patterns.conf>>
<<include etc/housekeeping.conf>>

#run circos
circos -conf test.conf

circosLong range intra-chromosomal interactions on chromosome one.

Conclusions

I've showed a way of visualisation the ChIA-PET dataset but not on using the dataset. One way I intend to use this dataset is to modify the Perl script above to produce a bed file containing the genomic loci that interact with another loci. Then just to get an idea of what these regions encompass, I would intersect them with some genome annotation file.




Creative Commons License
This work is licensed under a Creative Commons
Attribution 4.0 International License
.
8 comments Add yours
  1. Hi nice poste :),
    Just I think there is a small problem in your script,
    in the bed file, if you notice the interactions are repeated if it is an inter-chromosome interaction, but it will be written in one line if it is itra-chromosomal interaction.
    So you’ll have a lot of duplicated interactions,
    you can just add a test, if they are in the same chromosome, go to next line, otherwise skip the next line.

    1. Hi Nadhi,

      You’re right; I was a bit sloppy there.

      I adjusted the code to print out only one line for inter-chromosomal interactions.

      Thanks for letting me know 🙂

      Cheers,

      Dave

  2. Hi Dave,
    I solved the problem…thanks for your help.
    However, I would like to know as to how to change the colour to read or any other bright colour in the script for detecting inter-chromosomal interactions.

    regards,
    Amit.

  3. Hi Dave,
    Thank you for the post, I am facing a problem which is as follows;
    I used the first script to convert .bed file to .link file and after that I created the three files ideogram.conf(line 1 to 24), ticks.conf(line 25 to 67) and test.conf(line 69 to 103). After this I ran the command circos -conf test.conf but I got an error which was like this;
    ebuggroup summary 0.13s welcome to circos v0.64 2 May 2013
    debuggroup summary 0.13s loading configuration from file test.conf
    debuggroup summary 0.13s found conf file test.conf

    *** CIRCOS ERROR ***

    CONFIGURATION FILE ERROR

    Error parsing the configuration file. You used an <> directive,
    but the FILE could not be found. This FILE is interpreted relative to the
    configuration file in which the <> directive is used. Circos lookd
    for the file in these directories

    /etc/circos

    .

    ./etc

    /usr/bin/etc

    /usr/bin/../etc

    /usr/bin/..

    /usr/bin

    The Config::General module reported the error

    Config::General The file “etc/image.conf” does not exist within ConfigPath:
    /etc/circos…./etc./usr/bin/etc./usr/bin/../etc./usr/bin/…/usr/bin! at
    /usr/share/perl5/Circos/Configuration.pm line 707.

    If you are having trouble debugging this error, use this tutorial to learn how
    to use the debugging facility

    http://www.circos.ca/tutorials/lessons/configuration/debugging

    If you’re still stumped, get support in the Circos Google Group

    http://groups.google.com/group/circos-data-visualization

    Stack trace:
    at /usr/share/perl5/Circos/Error.pm line 354.
    Circos::Error::fatal_error(‘configuration’, ‘cannot_find_include’, ‘/etc/circos\x{a}.\x{a}./etc\x{a}/usr/bin/etc\x{a}/usr/bin/../etc\x{a}/usr/bin/..\x{a}…’, ‘Config::General The file “etc/image.conf” does not exist with…’) called at /usr/share/perl5/Circos/Configuration.pm line 719
    Circos::Configuration::loadconfiguration(‘test.conf’) called at /usr/share/perl5/Circos.pm line 197
    Circos::run(‘Circos’, ‘configfile’, ‘test.conf’) called at /usr/bin/circos line 300
    Please can you tell me where I went wrong. I could run circos using the file in example folder.

    1. Hi Akash,

      it seems that Circos could not find some file.

      Error parsing the configuration file. You used an <> directive,
      but the FILE could not be found. This FILE is interpreted relative to the
      configuration file in which the <> directive is used.

      I’m not sure but try looking for unmatched <>‘s in your configuration file.

    2. Hi Akash,

      Don’t know whether you have solved the problem or not. I met the same problem the first time I tried. I think you also misunderstood the codes provided above. The contents of “ideogram.conf” actually should be line 3-24, “ticks.conf” line 28-67 and “test.conf” line 70-106. That “cat ×××” should be a command run by Dave. lol

  4. Trying to run circos but getting this error, Can you help me with this Dave?
    circos -conf test.conf (after running this command)
    *** CIRCOS ERROR ***

    CONFIGURATION FILE ERROR

    …error text from [error/configuration.missing.txt] could not be read…

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.