ENCODE RNA polymerase II ChIP-seq

Chromatin immunoprecipitation sequencing (ChIP-seq) is a high throughput method for investigating protein-DNA interactions and aims to determine whether specific proteins are interacting with specific genomic loci. The workflow consists of crosslinking DNA and protein together, usually via the use of formaldehyde, which induces protein-DNA and protein-protein crosslinks. Importantly, these crosslinks are reversible by incubation at 70°C. Next the crosslinked DNA-protein complexes are sheared into roughly 500 bp fragments, usually by sonication. At this point we have "sheared DNA" and "sheared DNA crosslinked with proteins". Now comes the immunoprecipitation step, which is a technique that precipitates a protein antigen out of solution using an antibody that recognises that particular antigen. The crosslinking would result in many DNA-protein interactions and we use immunoprecipitation to pull down the protein of interest with the DNA region it was interacting with. After immunoprecipitation, the formaldehyde crosslinks are reversed by heating and the DNA strands are purified and sequenced. There's a nice graphic depicting this workflow at the Wikipedia article for ChIP-seq.

Continue reading

ENCODE mappability and repeats

The ENCODE mappability tracks can be visualised on the UCSC Genome Browser and they provide a sense of how mappable a region of the genome is in terms of short reads or k-mers. On a side note, it seems some people use "mapability" and some use "mappability"; I was taught the CVC rule, so I'm going to stick with "mappability".

There are two sets of tracks, alignability and uniqueness, but what's the difference between the two? From the UCSC Genome Browser:

Alignability - These tracks provide a measure of how often the sequence found at the particular location will align within the whole genome. Unlike measures of uniqueness, alignability will tolerate up to 2 mismatches. These tracks are in the form of signals ranging from 0 to 1 and have several configuration options.

Uniqueness - These tracks are a direct measure of sequence uniqueness throughout the reference genome. These tracks are in the form of signals ranging from 0 to 1 and have several configuration options.

Let's take a look at two examples, where I try to use k-mers of similar size:

#download the files
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeMapability/wgEncodeDukeMapabilityUniqueness35bp.bigWig
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeMapability/wgEncodeCrgMapabilityAlign36mer.bigWig
#bigWigToBedGraph can be downloaded at http://hgdownload.cse.ucsc.edu/admin/exe/
#convert into flat format
bigWigToBedGraph wgEncodeDukeMapabilityUniqueness35bp.bigWig wgEncodeDukeMapabilityUniqueness35bp.bedGraph
bigWigToBedGraph wgEncodeCrgMapabilityAlign36mer.bigWig wgEncodeCrgMapabilityAlign36mer.bedGraph
head wgEncodeDukeMapabilityUniqueness35bp.bedGraph
chr1    0       10145   0
chr1    10145   10160   1
chr1    10160   10170   0.5
chr1    10170   10175   0.333333
chr1    10175   10177   0.5
chr1    10177   10215   0
chr1    10215   10224   1
chr1    10224   10229   0.5
chr1    10229   10248   1
chr1    10248   10274   0
head wgEncodeCrgMapabilityAlign36mer.bedGraph
chr1    10000   10078   0.0013624
chr1    10078   10081   0.0238095
chr1    10081   10088   0.0185185
chr1    10088   10089   0.0147059
chr1    10089   10096   0.0185185
chr1    10096   10097   0.0238095
chr1    10097   10099   0.0185185
chr1    10099   10102   0.00172712
chr1    10102   10120   0.0013624
chr1    10120   10121   0.00172712

The bedGraph format simply associates a score (fourth column) to a particular region, as defined in the first three columns. At first glance you can already see the difference in the precision of the scores. The scores were calculated the same way, 1/the number of places a 35/36mer maps to the genome; however the uniqueness track will only keep scores for reads that map up to 4 places (0.25). So according to mappability track, the region chr1:10000-10035 will map to 734 places (since 1/734 =~ .0013624).

Continue reading

Using the ENCODE ChIA-PET dataset

Updated: 2014 March 14th

From the Wikipedia article:

Chromatin Interaction Analysis by Paired-End Tag Sequencing (ChIA-PET) is a technique that incorporates chromatin immunoprecipitation (ChIP)-based enrichment, chromatin proximity ligation, Paired-End Tags, and High-throughput sequencing to determine de novo long-range chromatin interactions genome-wide.

Let's get started on using the ENCODE ChIA-PET dataset by downloading the bed files, which has the interactions:

wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeGisChiaPet/wgEncodeGisChiaPetK562Pol2InteractionsRep1.bed.gz
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeGisChiaPet/wgEncodeGisChiaPetK562Pol2InteractionsRep2.bed.gz

#get others if you want
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeGisChiaPet/wgEncodeGisChiaPetHct116Pol2InteractionsRep1.bed.gz
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeGisChiaPet/wgEncodeGisChiaPetHelas3Pol2InteractionsRep1.bed.gz
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeGisChiaPet/wgEncodeGisChiaPetK562CtcfInteractionsRep1.bed.gz
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeGisChiaPet/wgEncodeGisChiaPetMcf7CtcfInteractionsRep1.bed.gz
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeGisChiaPet/wgEncodeGisChiaPetMcf7CtcfInteractionsRep2.bed.gz
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeGisChiaPet/wgEncodeGisChiaPetMcf7EraaInteractionsRep1.bed.gz
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeGisChiaPet/wgEncodeGisChiaPetMcf7EraaInteractionsRep2.bed.gz
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeGisChiaPet/wgEncodeGisChiaPetMcf7EraaInteractionsRep3.bed.gz
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeGisChiaPet/wgEncodeGisChiaPetMcf7Pol2InteractionsRep1.bed.gz
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeGisChiaPet/wgEncodeGisChiaPetMcf7Pol2InteractionsRep2.bed.gz
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeGisChiaPet/wgEncodeGisChiaPetMcf7Pol2InteractionsRep3.bed.gz
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeGisChiaPet/wgEncodeGisChiaPetMcf7Pol2InteractionsRep4.bed.gz
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeGisChiaPet/wgEncodeGisChiaPetNb4Pol2InteractionsRep1.bed.gz

For a preview and description of these interaction bed files, have a look at the table schema, which has the definition:

ChIA-PET Chromatin Interaction PET clusters: Two different genomic regions in the chromatin are genomically far from each other or in different chromosomes, but are spatially close to each other in the nucleus and interact with each other for regulatory functions. BED12 format is used to represent the data.

One cool way of visualising these chromtain interactions would be by using Circos (instructions for installing circos). Here's a simple Perl script to parse the bed12 files and prepare them in a format readable by Circos:

Continue reading

Using ENCODE methylation data

Post updated 2014 January 2nd

DNA methylation is a biochemical process and epigenetic modification, whereby a methyl group is added to the cytosine nucleotide (and also adenine) to form 5-methylcytosine. DNA methylation at the 5' position of cytosine has the specific effect of reducing gene transcription and typically occurs in CpG sites, which are regions of DNA where a cytosine occurs next to a guanine (the p stands for the phosphate link that links them together). Most CpGs are methylated in mammals, however, unmethylated CpGs are often grouped in clusters called CpG islands, which are present in the 5' regulatory regions of many genes.

One way of profiling methylation patterns in DNA is via the use of bisulphite treatment of DNA following by sequencing, which is known as bisulphite sequencing. The treatment of DNA with bisulphite converts cytosine residues to uracil, but leaves 5-methylcytosine residues unaffected. Therefore after sequencing, cytosine residues represent methylated cytosines in the genome. One variant of bisulphite sequencing, is reduced representation bisulphite sequencing (RRBS), which was developed as a cost-efficient method to profile areas of the genome that have a high CpG content. Genomic DNA is digested using the restriction endonuclease MspI, which recognises the sequence 5'-CCGG-3'. MspI is actually part of a isoschizomer pair with HpaII, which are restriction enzymes that are specific to the same recognition sequence. However MspI can recognise methylated cytosines, whereby HpaII cannot.

     MspI site               MspI site
5'-...CCGGNNNNNNNNNNNNNNNNNNNNNCCGG...-3'
3'-...GGCCNNNNNNNNNNNNNNNNNNNNNGGCC...-5'

             MspI digestion

5'-    CGGNNNNNNNNNNNNNNNNNNNNNC      -3'
3'-      CNNNNNNNNNNNNNNNNNNNNNGGC    -5'

Continue reading

Genome mapability

I know of the genome mapability and uniqueness tracks provided by the UCSC Genome Browser but I was just interested in doing this myself for the hg19 genome.

As a test, I investigated chr22, where the base composition is broken down as:

Length of chr22 = 51,304,566
A: 9,094,775
C: 8,375,984
G: 8,369,235
T: 9,054,551
Other: 16,410,021

I wrote a script to generate 35mer reads by sliding 1 bp along the chromosome:


#!/usr/bin/perl

use strict;
use warnings;

my $usage = "Usage: $0 <infile.fa>\n";
my $infile = shift or die $usage;

my $read_length = '35';
my $current_accession = '';
my %seq = ();

open(IN,'<',$infile) || die "Could not open $infile: $!\n";
while(<IN>){
   chomp;
   next if /^$/;
   if (/^>(.*)$/){
      $current_accession = $1;
      warn "Storing $current_accession\n";
   } else {
      $seq{$current_accession} .= $_;
   }
}
close(IN);

warn "Finished storing genome\n";

foreach my $chr (sort {$a cmp $b} keys %seq){
   my $length = length($seq{$chr});
   my $end = $length - $read_length;
   for (my $i = 0; $i <= $end; ++$i){
      my $read = substr($seq{$chr},$i,$read_length);
      $read = uc($read);
      next if $read =~ /N/;
      die "$read\n" unless $read =~ /[ATGC]+/;
      my $chr_start = $i + 1;
      my $chr_end = $i + $read_length;
      my $id = $chr . '_' . $chr_start . '_' . $chr_end;
      print ">$id\n$read\n";
   }
}

exit(0);

Any read that contained a "N" was removed. I then mapped these 35mers back to the hg19 genome using BWA, and here are the statistics:

Mapped: 34,894,095
Unmapped: 0
Edit distances
0 34,894,095
Multimap profile
Mapped uniquely 28,615,801
Mapped to 2 locations 1,350,653
Mapped to 3 locations 851,814
Mapped to 4 locations 388,360
Mapped to 5 locations 239,843
Mapped to 6 locations 194,604
Mapped to 7 locations 184,291
Mapped to 8 locations 149,467
Mapped to 9 locations 92,002
Mapped to 10 locations 66,698
Mapped to more than 10 location 2,760,562

28,615,801 / 34,894,095 (~82%) could be mapped uniquely.

Using the GENCODE exon list and intersectBed I characterised the list of uniquely and multi mapping reads.

350,930 / 6,278,294 (5.59%) of the multi mapping reads intersected GENCODE exons.

2,441,765 / 28,615,801 (8.53%) of the uniquely mapped reads intersected GENCODE exons; only a slighter higher percentage than multimappers.

It seems that most of chr22 contains unique stretches of nucleotides. These unique regions are not particularly enriched in exonic sequences in contrast to multimappers. Of the multi mapping reads that intersected exons, it would be interesting to see how many of these map to genes and pseudogenes, especially when we have 1,350,653 reads that mapped to two places in the genome.

And lastly by using genomeCoverageBed and by separating out the uniquely mapping reads, we can create a track that shows the unique parts of the genome.

genomeCoverageBed -ibam chr22_read_sorted_unique.bam -g ~/davetang/chrom_size/hg19_chrom_info.txt -d > result

cat result | awk '$3 != 0 {print}' | head
#chr22   16050002        1
#chr22   16050003        2
#chr22   16050004        3
#chr22   16050005        4
#chr22   16050006        5
#chr22   16050007        6
#chr22   16050008        7
#chr22   16050009        8
#chr22   16050010        9
#chr22   16050011        10

The third column, which shows the number of reads covering this position, is meaningless due to the way I slid across the chromosome. One can just change the values this into 1s and 0s, where 1 is unique and 0 is not unique.

cat result | perl -nle '@a=split; if ($a[2] != 0){ print "$a[0]\t$a[1]\t1" } else {print}'

Lastly a note about the Duke uniqueness tracks provided by the UCSC Genome Browser, which you can download here, since it's almost the same as what I showed above. For the 35mer track, here's part of the track:

chr1 68849 68859 0.333333
chr1 68859 68861 0.5
chr1 68861 68896 1
chr1 68896 69099 0.333333

I checked all the scores (fourth column) and they are either 0, 0.25, 0.333333, 0.5 and 1.

0 = mapped to more than 4 places
0.25 = mapped to 4 places
0.33 = mapped to 3 places
0.5 = mapped to 2 places
1 = mapped to 1 place

The scores are assigned to the first base of the sequence. Using the example above:

chr1 68859 68861 0.5

That should mean that chr1:68859-68894 and chr1:68860-68895 and chr1:68861-68896 all map to two places in the genome. To test this I extracted the sequence at chr1:68859-68894, which is

GAATTATAAGGCATAGCCAGGGAAGTAGTGCGAGAT

And blat returns three hits, with two best hits:

   ACTIONS      QUERY           SCORE START  END QSIZE IDENTITY CHRO STRAND  START    END      SPAN
---------------------------------------------------------------------------------------------------
browser details YourSeq           36     1    36    36 100.0%    19   +     110447    110482     36
browser details YourSeq           36     1    36    36 100.0%     1   +      68859     68894     36
browser details YourSeq           35     1    35    36 100.0%    15   -  102463460 102463494     35

And another test for chr1:68860-68895, which is this sequence AATTATAAGGCATAGCCAGGGAAGTAGTGCGAGATA and the blat result

   ACTIONS      QUERY           SCORE START  END QSIZE IDENTITY CHRO STRAND  START    END      SPAN
---------------------------------------------------------------------------------------------------
browser details YourSeq           36     1    36    36 100.0%    19   +     110448    110483     36
browser details YourSeq           36     1    36    36 100.0%     1   +      68860     68895     36
browser details YourSeq           34     1    36    36  97.3%    15   -  102463458 102463493     36

Here's a slide I made for the Duke uniqueness track using 20 bp windows:

GENCODE

By now you should have heard about the ENCODE project. GENCODE, summarised as Encyclopædia of genes and gene variants, is a sub-project of ENCODE where the aim is to annotate all evidence-based gene features in the entire human genome with high accuracy. This includes protein coding genes and their isoforms, non coding RNAs and pseudogenes. The process to create this annotation involves (1) manual curation, (2) different computational analysis and (3) targeted experimental approaches. Putative loci can be verified by wet-lab experiments and computational predictions will be analysed manually.

The GENCODE gene sets have been used by the ENCODE consortium as well as the 1000 Genomes Project as reference gene sets.

For those wishing to download the datasets directly, visit the ftp site.

For each feature there are three different confidence levels and in verbatim from the README:

Level 1: validated

Pseudogene loci, that were predicted by the analysis-pipelines from YALE, UCSC as well as by HAVANA manual annotation from WTSI. Other transcripts, that were verified experimentally by RT-PCR and sequencing.

Level 2: manual annotation

HAVANA manual annotation from WTSI (and Ensembl annotation where it is identical to Havana). The following regions are considered "fully annotated" although they will still be updated: chromosomes 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, X, Y, ENCODE pilot regions.

Level 3: automated annotation

ENSEMBL loci where they are different from the HAVANA annotation or where no annotation can be found.

In the latest release (13), there are these files (descriptions taken from the header of each respective file):

gencode.v13.2wayconspseudos.gtf.gz - evidence-based annotation of the human genome (GRCh37), version 13 (Ensembl 68) - Yale-UCSC 2-way consensus pseudogenes

gencode.v13.annotation.gtf.gz gz Archive - evidence-based annotation of the human genome (GRCh37), version 13 (Ensembl 68)

gencode.v13.annotation_corrected_statuses.gtf.gz - evidence-based annotation of the human genome (GRCh37), version 13 (Ensembl 68)

gencode.v13.long_noncoding_RNAs.gtf.gz - evidence-based annotation of the human genome (GRCh37), version 13 (Ensembl 68) - long non-coding RNAs

gencode.v13.pc_transcripts.fa.gz - nucleotide fasta file of the transcripts (89,828 fasta header lines)

gencode.v13.pc_translations.fa.gz - amino acid fasta file of the transcripts (89,828 fasta header lines)

gencode.v13.polyAs.gtf.gz - evidence-based annotation of the human genome (GRCh37) - polyA features

gencode.v13.tRNAs.gtf.gz - evidence-based annotation of the human genome (GRCh37) - tRNAs (625 tRNAs)

gencode13_GRCh37.tgz - this tarball contains another layer of files:

gencode.v13.2wayconspseudos.classes
gencode.v13.2wayconspseudos.gtf
gencode.v13.annotation.level_1_2.classes
gencode.v13.annotation.level_1_2.gtf
gencode.v13.annotation.level_3.classes
gencode.v13.annotation.level_3.gtf
gencode.v13.polyAs.classes
gencode.v13.polyAs.gtf

Understanding the gencode.v13.annotation_corrected_statuses.gtf.gz file

For information on GTF files: http://www.gencodegenes.org/gencodeformat.html

There are 2,504,987 lines in this file. The breakdown of the feature types (column 3) in this file are (where the first column shows the tallied number):

 702193 CDS
    103 Selenocysteine
 268425 UTR
1143113 exon
  55123 gene
  80249 start_codon
  72814 stop_codon
 182967 transcript

As you can see above, there are more transcripts than genes; more statistics on the data release can be found here. For example the gene DDX11L1 has one gene entry in the GTF file and several transcripts:

chr1 HAVANA gene 11869 14412 . + . gene_id "ENSG00000223972.4"; transcript_id "ENSG00000223972.4"; gene_type "pseudogene"; gene_status "NULL"; gene_name "DDX11L1"; transcript_type "pseudogene"; transcript_status "NULL"; transcript_name "DDX11L1"; level 2; havana_gene "OTTHUMG00000000961.2";

chr1 HAVANA transcript 11869 14409 . + . gene_id "ENSG00000223972.4"; transcript_id "ENST00000456328.2"; gene_type "pseudogene"; gene_status "NULL"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_status "NULL"; transcript_name "DDX11L1-002"; level 2; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";
chr1 ENSEMBL transcript 11872 14412 . + . gene_id "ENSG00000223972.4"; transcript_id "ENST00000515242.2"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type "unprocessed_pseudogene"; transcript_status "KNOWN"; transcript_name "DDX11L1-201"; level 3; havana_gene "OTTHUMG00000000961.2";
chr1 ENSEMBL transcript 11874 14409 . + . gene_id "ENSG00000223972.4"; transcript_id "ENST00000518655.2"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type "unprocessed_pseudogene"; transcript_status "KNOWN"; transcript_name "DDX11L1-202"; level 3; havana_gene "OTTHUMG00000000961.2";
chr1 HAVANA transcript 12010 13670 . + . gene_id "ENSG00000223972.4"; transcript_id "ENST00000450305.2"; gene_type "pseudogene"; gene_status "NULL"; gene_name "DDX11L1"; transcript_type "transcribed_unprocessed_pseudogene"; transcript_status "NULL"; transcript_name "DDX11L1-001"; level 2; ont "PGO:0000005"; ont "PGO:0000019"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000002844.2";

Full descriptions of different biotypes are available at http://www.gencodegenes.org/gencode_biotypes.html and can be found in the additional information column (column 9). For example a "processed_transcript" doesn't contain an ORF and a "pseudogene" has homology to proteins but generally suffer from a disrupted coding sequence and an active homologous gene can be found at another locus.

The latest hg19 RefSeq release contains 40,346 entries and the latest UCSC genes release contains 80,922 entries (both downloaded from the UCSC Table Browser). I have been using the RefSeq for many years but it is time to move onto GENCODE.

And lastly here's an anything but elegant script that converts the GTP file into a BED12 file for all transcripts:


#!/usr/bin/perl

use strict;
use warnings;

my $usage = "Usage: $0 <infile.annotation.gtf>\n";
my $infile = shift or die $usage;

if ($infile =~ /\.gz/){
   open(IN,'-|',"gunzip -c $infile") || die "Could not open $infile: $!\n";
} else {
   open(IN,'<',$infile) || die "Could not open $infile: $!\n";
}

my %transcript = ();
my $block_count = '0';
my $block_size = '';
my $block_start = '';
while(<IN>){
   chomp;
   next if (/^#/);
   my ($chr,$source,$type,$start,$end,$score,$strand,$phase,$annotation) = split(/\t/);
   next unless $type eq 'transcript' || $type eq 'exon';
   my @annotation = split(/;\s/,$annotation);
   my $transcript_id = 'none';
   foreach my $blah (@annotation){
      my ($type,$name) = split(/\s+/,$blah);
      #if ($type eq 'transcript_name'){
      if ($type eq 'transcript_id'){
         $transcript_id = $name;
         $transcript_id =~ s/"//g;
      }
   }
   if ($transcript_id eq 'none'){
      die "No name for entry $.\n";
   }
   if (exists $transcript{$transcript_id}){
      ++$block_count;
      my $this_block_size = $end - $start + 1;
      my $this_block_start = $start - $transcript{$transcript_id}->{'START'} - 1;
      $block_size .= "$this_block_size,";
      $block_start .= "$this_block_start,";
      if ($strand eq '+'){
         if ($end == $transcript{$transcript_id}->{'END'}){
            $block_count =~ s/,$//;
            $block_size =~ s/,$//;
            $block_start =~ s/,$//;
            print join("\t",$chr,$transcript{$transcript_id}->{'START'},$transcript{$transcript_id}->{'END'},$transcript_id,0,$strand,$transcript{$transcript_id}->{'START'},$transcript{$transcript_id}->{'END'},'255,0,0',$block_count,$block_size,$block_start),"\n";
            $block_count = '0';
            $block_size = '';
            $block_start = '';
            next;
         }
      } else {
         if ($start - 1 == $transcript{$transcript_id}->{'START'}){
            $block_count =~ s/,$//;
            $block_size =~ s/,$//;
            $block_start =~ s/,$//;
            my @block_start = split(/,/,$block_start);
            my @block_size = split(/,/,$block_size);
            @block_start = reverse(@block_start);
            @block_size = reverse(@block_size);
            $block_start = join(",",@block_start);
            $block_size = join(",",@block_size);
            print join("\t",$chr,$transcript{$transcript_id}->{'START'},$transcript{$transcript_id}->{'END'},$transcript_id,0,$strand,$transcript{$transcript_id}->{'START'},$transcript{$transcript_id}->{'END'},'255,0,0',$block_count,$block_size,$block_start),"\n";
            $block_count = '0';
            $block_size = '';
            $block_start = '';
            next;
         }
      }
   } else {
      $transcript{$transcript_id}->{'START'} = $start - 1;
      $transcript{$transcript_id}->{'END'} = $end;
   }
}
close(IN);

exit(0);

I could successfully upload the BED file to the UCSC Genome Browser and visually double checked a handful of transcripts and they looked exactly the same as the GENCODE v12 transcript models available on the UCSC Genome Browser.

Encyclopedia of DNA elements (ENCODE)

ENCODE is an abbreviation of "Encyclopedia of DNA elements", a project that aims to discover and define the functional elements encoded in the human genome via multiple technologies. Used in this context, the term "functional elements" is used to denote a discrete region of the genome that encodes a defined product (e.g., protein) or a reproducible biochemical signature, such as transcription or a specific chromatin structure. It is now widely appreciated that such signatures, either alone or in combination, mark genomic sequences with important functions, including exons, sites of RNA processing, and transcriptional regulatory elements such as promoters, enhancers, silencers, and insulators.

The data included from the ENCODE project include:

  1. Identification and quantification of RNA species in whole cells and in sub-cellular compartments
  2. Mapping of protein-coding regions
  3. Delineation of chromatin and DNA accessibility and structure with nucleases and chemical probes
  4. Mapping of histone modifications and transcription factor binding sites (TFBSs) by chromatin immunoprecipitation (ChIP)
  5. and measurement of DNA methylation

To facilitate comparison and integration of data, ENCODE data production efforts have prioritised selected sets of cell types. The highest priority set (designated "Tier 1) includes two widely studied immortalised cell lines - K562 erythroleukemia cells; an EBV-immortalised B-lymphoblastoid line and the H1 human embryonic stem cell line. A secondary priority set (Tier 2) includes HeLa-S3 cervical carcinoma cells, HepG2 hepatoblastoma cells, and primary (non-transformed) human umbilical vein endothelial cells (HUVEC), which have limited proliferation potential in culture. A third set (Tier 3) currently comprises more than 100 cell types that are being analysed in selected assays.

A major goal of ENCODE is to manually annotate all protein-coding genes, pseudogenes, and non-coding transcribed loci in the human genome and to catalog the products of transcription including splice isoforms. This annotation process involves consolidation of all evidence of transcripts (cDNA, EST sequences) and proteins from public databases, followed by building gene structures based on supporting experimental data.

Ultimately, each gene or transcript model is assigned one of three confidence levels:

  1. Level 1 includes genes validated by RT-PCR and sequencing, plus consensus psuedogenes.
  2. Level 2 includes manually annotated coding and long non-coding loci that have transcriptional evidence in EMBL/GenBank.
  3. Level 3 includes Ensembl gene predictions in regions not yet manually annotated or for which there is new transcriptional evidence.

The result of ENCODE gene annotation (termed "GENCODE") is a comprehensive catalog of transcripts and gene models.

Another goal of ENCODE is to produce a comprehensive genome-wide catalog of transcribed loci that characterises the size, polyadenylation status, and subcellular compartmentalisation of all transcripts. Both polyA+ and polyA- RNAs are being analysed and not only total whole cell RNAs but also those concentrated in the nucleus and cytosol. Long (>200nt) and short RNAs (<200nt) are being sequenced from each subcellular compartment, providing catalogs of potential miRNAs, snoRNA, promoter-associated short RNAs (PASRs) and other short cellular RNAs. Cis-regulatory regions include diverse functional elements (e.g., promoters, enhancers, silencers, and insulators) that collectively modulate the magnitude, timing, and cell-specificity of gene expression. The ENCODE Project is using multiple approaches to identify cis-regulatory regions, including localising their characteristic chromatin signatures and identifying sites of occupancy of sequence-specific transcription factors. Human cis-regulatory regions characteristically exhibit nuclease hypersensitivity and may show increased solubility after chromatin fixation and fragmentation. Additionally, specific patterns of post-translational histone modifications have been connected with distinct classes of regions such as promoters and enhancers. Chromatin accessibility and histone modifications thus provide independent and complementary annotations of human regulatory DNA. DNaseI hypersensitive sites (DHSs) are being mapped by two techniques: (i) capture of free DNA ends at in vivo DNaseI cleavage sites with biotinylated adapters, followed by digestion with a TypeIIS restiction enzyme to generate ~20 bp DNaseI cleavage site tags and (ii) direct sequencing of DNaseI cleavage sites at the ends of small (<300 bp) DNA fragments released by limiting treatment with DNaseI. For more information see the ENCODE user guide.

Downloading ENCODE data

Release ENCODE data can be download at http://genome.ucsc.edu/ENCODE/downloads.html.

For example if you are interested in the H3K4Me1 CHiP-Seq data, bigWig files are provided at http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeRegMarkH3k4me1/. You may also be interested in the H3K4Me3 data, since this provides some perspective on the H3K4Me1 data.

To convert the bigWig file to a human readable (i.e. non-binary), download the bigWigToBedGraph executable.

Then convert the bigWig file to bedGraph by running:

bigWigToBedGraph wgEncodeBroadHistoneK562H3k4me1StdSig.bigWig wgEncodeBroadHistoneK562H3k4me1StdSig.bedgraph

The output:

chr1 10100 10125 0.6
chr1 10125 10225 1
chr1 10225 10250 1.36
chr1 10250 10275 2
chr1 10275 10300 3.64

Which can then be transformed into a BED file, for use with intersectBed.