The Encyclopedia Of DNA Elements (ENCODE) project began in September 2003 as an initiative to identify all functional elements in the human genome sequence[1]. The first phase of this project examined roughly 1% or 30 million DNA bases by various experimental tests and computational analyses. In addition they found that 5% of the genome is under evolutionary constraint; of this 5%:

  • 40% consists of protein-coding genes
  • 20% consists of known, functional, non-coding elements
  • 40% consists of sequence with no known function

Links to articles

Overview articles

Genomics: ENCODE explained -> http://www.nature.com/nature/journal/v489/n7414/full/489052a.html

The making of ENCODE: Lessons for big-data projects -> http://www.nature.com/nature/journal/v489/n7414/full/489049a.html

ENCODE: The human encyclopaedia -> http://www.nature.com/news/encode-the-human-encyclopaedia-1.11312

Research articles

Links to data

ENCODE human data matrix -> https://genome.ucsc.edu/ENCODE/dataMatrix/encodeDataMatrixHuman.html

ENCODE human ChIP-seq matrix -> https://genome.ucsc.edu/ENCODE/dataMatrix/encodeChipMatrixHuman.html

ENCODE cell lines -> http://genome.ucsc.edu/ENCODE/cellTypes.html

Chromatin states

Mapping and analysis of chromatin state dynamics in nine human cell types: http://www.ncbi.nlm.nih.gov/pubmed/21441907

Cell line information: see ​http://genome.ucsc.edu/ENCODE/cellTypes.html

  • H1ES - H1 human embryonic stem cells
  • K562 - an immortalized cell line produced from a female patient with chronic myelogenous leukemia (CML)
  • GM12878 - a lymphoblastoid cell line produced from the blood of a female donor with northern and western European ancestry by EBV transformation
  • HepG2 - a cell line derived from a male patient with liver carcinoma
  • HUVEC - human umbilical vein endothelial cells have a normal karyotype
  • HSMM - skeletal muscle myoblasts from the mesoderm lineage and muscle tissue with a normal karyotype
  • NHLF - lung fibroblasts from the endoderm lineage and lung tissue with a normal karyotype
  • NHEK - epidermal keratinocytes from the ectoderm lineage and skin with a normal karyotype
  • HMEC - mammary epithelial cells from the ectoderm lineage and breast tissue with a normal karyotype

The fifteen states

  • 1_Active_Promoter
  • 2_Weak_Promoter
  • 3_Poised_Promoter
  • 4_Strong_Enhancer
  • 5_Strong_Enhancer
  • 6_Weak_Enhancer
  • 7_Weak_Enhancer
  • 8_Insulator
  • 9_Txn_Transition
  • 10_Txn_Elongation
  • 11_Weak_Txn
  • 12_Repressed
  • 13_Heterochrom/lo
  • 14_Repetitive/CNV
  • 15_Repetitive/CNV


From the ENCODE human ChIP-seq matrix, there are 2 datasets for EP300 for H1-hESC from the Myers lab. Download the peaks (broadPeak files) and raw signal (bigWig files):

wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeHaibTfbs/wgEncodeHaibTfbsH1hescP300V0416102PkRep1.broadPeak.gz
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeHaibTfbs/wgEncodeHaibTfbsH1hescP300V0416102RawRep1.bigWig
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeHaibTfbs/wgEncodeHaibTfbsH1hescP300V0416102PkRep2.broadPeak.gz
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeHaibTfbs/wgEncodeHaibTfbsH1hescP300V0416102RawRep2.bigWig

Use bigWigToBedGraph to convert from bigWig to bedGraph format:

bigWigToBedGraph - Convert from bigWig to bedGraph format.
   bigWigToBedGraph in.bigWig out.bedGraph
   -chrom=chr1 - if set restrict output to given chromosome
   -start=N - if set, restrict output to only that over start
   -end=N - if set, restict output to only that under end
   -udcDir=/dir/to/cache - place to put cache for remote bigBed/bigWigs

Run bigWigToBedGraph:

bigWigToBedGraph wgEncodeHaibTfbsH1hescP300V0416102RawRep1.bigWig wgEncodeHaibTfbsH1hescP300V0416102RawRep1.bedGraph
bigWigToBedGraph wgEncodeHaibTfbsH1hescP300V0416102RawRep2.bigWig wgEncodeHaibTfbsH1hescP300V0416102RawRep2.bedGraph

Check out the bedGraph files:

head -3 *.bedGraph
==> wgEncodeHaibTfbsH1hescP300V0416102RawRep1.bedGraph <==
chr1    10245   10246   0.385339
chr1    10246   10248   0.462407
chr1    10248   10261   0.616542

==> wgEncodeHaibTfbsH1hescP300V0416102RawRep2.bedGraph <==
chr1    10156   10157   0.20465
chr1    10157   10159   0.255812
chr1    10159   10176   0.409299

Check out the broadPeak files:

zcat wgEncodeHaibTfbsH1hescP300V0416102PkRep1.broadPeak.gz | head -3
chr1    1072006 1072341 peak1   63      .       161.020 -1      -1
chr1    1590312 1590598 peak2   49      .       126.760 -1      -1
chr1    1624111 1624681 peak3   318     .       811.620 -1      -1

zcat wgEncodeHaibTfbsH1hescP300V0416102PkRep2.broadPeak.gz | head -3
chr1    839954  840316  peak1   53      .       164.470 -1      -1
chr1    856387  856832  peak2   36      .       112.670 -1      -1
chr1    937184  937561  peak3   48      .       149.600 -1      -1

ENCODE broadPeak file format -> http://genome.ucsc.edu/FAQ/FAQformat.html#format13

This format is used to provide called regions of signal enrichment based on pooled, normalized (interpreted) data. It is a BED 6+3 format.

  1. chrom - Name of the chromosome (or contig, scaffold, etc.).
  2. chromStart - The starting position of the feature in the chromosome or scaffold. The first base in a chromosome is numbered 0.
  3. chromEnd - The ending position of the feature in the chromosome or scaffold. The chromEnd base is not included in the display of the feature. For example, the first 100 bases of a chromosome are defined as chromStart=0, chromEnd=100, and span the bases numbered 0-99. If all scores were '0' when the data were submitted to the DCC, the DCC assigned scores 1-1000 based on signal value. Ideally the average # signalValue per base spread is between 100-1000.
  4. name - Name given to a region (preferably unique). Use '.' if no name is assigned.
  5. score - Indicates how dark the peak will be displayed in the browser (0-1000).
  6. strand - +/- to denote strand or orientation (whenever applicable). Use '.' if no orientation is assigned.
  7. signalValue - Measurement of overall (usually, average) enrichment for the region.
  8. pValue - Measurement of statistical significance (-log10). Use -1 if no pValue is assigned.
  9. qValue - Measurement of statistical significance using false discovery rate (-log10). Use -1 if no qValue is assigned.

