Encyclopedia of DNA elements (ENCODE)

ENCODE is an abbreviation of “Encyclopedia of DNA elements”, a project that aims to discover and define the functional elements encoded in the human genome via multiple technologies. Used in this context, the term “functional elements” is used to denote a discrete region of the genome that encodes a defined product (e.g., protein) or a reproducible biochemical signature, such as transcription or a specific chromatin structure. It is now widely appreciated that such signatures, either alone or in combination, mark genomic sequences with important functions, including exons, sites of RNA processing, and transcriptional regulatory elements such as promoters, enhancers, silencers, and insulators.

The data included from the ENCODE project include:

  1. Identification and quantification of RNA species in whole cells and in sub-cellular compartments
  2. Mapping of protein-coding regions
  3. Delineation of chromatin and DNA accessibility and structure with nucleases and chemical probes
  4. Mapping of histone modifications and transcription factor binding sites (TFBSs) by chromatin immunoprecipitation (ChIP)
  5. and measurement of DNA methylation

To facilitate comparison and integration of data, ENCODE data production efforts have prioritised selected sets of cell types. The highest priority set (designated “Tier 1) includes two widely studied immortalised cell lines – K562 erythroleukemia cells; an EBV-immortalised B-lymphoblastoid line and the H1 human embryonic stem cell line. A secondary priority set (Tier 2) includes HeLa-S3 cervical carcinoma cells, HepG2 hepatoblastoma cells, and primary (non-transformed) human umbilical vein endothelial cells (HUVEC), which have limited proliferation potential in culture. A third set (Tier 3) currently comprises more than 100 cell types that are being analysed in selected assays.

A major goal of ENCODE is to manually annotate all protein-coding genes, pseudogenes, and non-coding transcribed loci in the human genome and to catalog the products of transcription including splice isoforms. This annotation process involves consolidation of all evidence of transcripts (cDNA, EST sequences) and proteins from public databases, followed by building gene structures based on supporting experimental data.

Ultimately, each gene or transcript model is assigned one of three confidence levels:

  1. Level 1 includes genes validated by RT-PCR and sequencing, plus consensus psuedogenes.
  2. Level 2 includes manually annotated coding and long non-coding loci that have transcriptional evidence in EMBL/GenBank.
  3. Level 3 includes Ensembl gene predictions in regions not yet manually annotated or for which there is new transcriptional evidence.

The result of ENCODE gene annotation (termed “GENCODE”) is a comprehensive catalog of transcripts and gene models.

Another goal of ENCODE is to produce a comprehensive genome-wide catalog of transcribed loci that characterises the size, polyadenylation status, and subcellular compartmentalisation of all transcripts. Both polyA+ and polyA- RNAs are being analysed and not only total whole cell RNAs but also those concentrated in the nucleus and cytosol. Long (>200nt) and short RNAs (<200nt) are being sequenced from each subcellular compartment, providing catalogs of potential miRNAs, snoRNA, promoter-associated short RNAs (PASRs) and other short cellular RNAs. Cis-regulatory regions include diverse functional elements (e.g., promoters, enhancers, silencers, and insulators) that collectively modulate the magnitude, timing, and cell-specificity of gene expression. The ENCODE Project is using multiple approaches to identify cis-regulatory regions, including localising their characteristic chromatin signatures and identifying sites of occupancy of sequence-specific transcription factors. Human cis-regulatory regions characteristically exhibit nuclease hypersensitivity and may show increased solubility after chromatin fixation and fragmentation. Additionally, specific patterns of post-translational histone modifications have been connected with distinct classes of regions such as promoters and enhancers. Chromatin accessibility and histone modifications thus provide independent and complementary annotations of human regulatory DNA. DNaseI hypersensitive sites (DHSs) are being mapped by two techniques: (i) capture of free DNA ends at in vivo DNaseI cleavage sites with biotinylated adapters, followed by digestion with a TypeIIS restiction enzyme to generate ~20 bp DNaseI cleavage site tags and (ii) direct sequencing of DNaseI cleavage sites at the ends of small (<300 bp) DNA fragments released by limiting treatment with DNaseI. For more information see the ENCODE user guide.

Downloading ENCODE data

Release ENCODE data can be download at http://genome.ucsc.edu/ENCODE/downloads.html.

For example if you are interested in the H3K4Me1 CHiP-Seq data, bigWig files are provided at http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeRegMarkH3k4me1/. You may also be interested in the H3K4Me3 data, since this provides some perspective on the H3K4Me1 data.

To convert the bigWig file to a human readable (i.e. non-binary), download the bigWigToBedGraph executable.

Then convert the bigWig file to bedGraph by running:

bigWigToBedGraph wgEncodeBroadHistoneK562H3k4me1StdSig.bigWig wgEncodeBroadHistoneK562H3k4me1StdSig.bedgraph

The output:

chr1 10100 10125 0.6
chr1 10125 10225 1
chr1 10225 10250 1.36
chr1 10250 10275 2
chr1 10275 10300 3.64

Which can then be transformed into a BED file, for use with intersectBed.

Print Friendly, PDF & Email

Creative Commons License
This work is licensed under a Creative Commons
Attribution 4.0 International License
2 comments Add yours
  1. Hi Davo,

    Great help from your website for both information and scripts.

    I am also interested in ENCODE project, while I still get confused about the data.

    As far as I know, bigwig files are raw signals, however, ENCODE also offers peak files processed based on raw signal files.

    Then what’s the different usages of these two files?

    And another question is that for peak files, there is also peak & uniform peak files, what’s the difference between these files?

    Actually, if we want to overlap the TF and DNase, we should use chipseq & DNase to process the results, which file should I use? raw signal files or just peak files?


    1. Hi Junfeng,

      Glad you found the site useful and thanks for the comment.

      You can find more information on the peak files at this link. The bigWig file is mainly used for visualisation purposes on the UCSC Genome Browser.

      As for overlapping ChIP-seq data with DNase, I would use the peak files. But have a look closely at the link above to understand how the peaks are called and if that serves your purpose.

      Hope that helps,


Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.