ENCODE

From Dave's wiki
Jump to navigation Jump to search

The Encyclopedia Of DNA Elements (ENCODE) project began in September 2003 as an initiative to identify all functional elements in the human genome sequence[1]. The first phase of this project examined roughly 1% or 30 million DNA bases by various experimental tests and computational analyses. In addition they found that 5% of the genome is under evolutionary constraint; of this 5%:

  • 40% consists of protein-coding genes
  • 20% consists of known, functional, non-coding elements
  • 40% consists of sequence with no known function

Links to articles

The articles below are also included in the free ENCODE iPad app.

Overview articles

Genomics: ENCODE explained -> http://www.nature.com/nature/journal/v489/n7414/full/489052a.html

The making of ENCODE: Lessons for big-data projects -> http://www.nature.com/nature/journal/v489/n7414/full/489049a.html

ENCODE: The human encyclopaedia -> http://www.nature.com/news/encode-the-human-encyclopaedia-1.11312

Research articles

Links to data

ENCODE human data matrix -> https://genome.ucsc.edu/ENCODE/dataMatrix/encodeDataMatrixHuman.html

ENCODE human ChIP-seq matrix -> https://genome.ucsc.edu/ENCODE/dataMatrix/encodeChipMatrixHuman.html

ENCODE cell lines -> http://genome.ucsc.edu/ENCODE/cellTypes.html

Chromatin states

Mapping and analysis of chromatin state dynamics in nine human cell types: http://www.ncbi.nlm.nih.gov/pubmed/21441907

Cell line information: see ​http://genome.ucsc.edu/ENCODE/cellTypes.html

  • H1ES - H1 human embryonic stem cells
  • K562 - an immortalized cell line produced from a female patient with chronic myelogenous leukemia (CML)
  • GM12878 - a lymphoblastoid cell line produced from the blood of a female donor with northern and western European ancestry by EBV transformation
  • HepG2 - a cell line derived from a male patient with liver carcinoma
  • HUVEC - human umbilical vein endothelial cells have a normal karyotype
  • HSMM - skeletal muscle myoblasts from the mesoderm lineage and muscle tissue with a normal karyotype
  • NHLF - lung fibroblasts from the endoderm lineage and lung tissue with a normal karyotype
  • NHEK - epidermal keratinocytes from the ectoderm lineage and skin with a normal karyotype
  • HMEC - mammary epithelial cells from the ectoderm lineage and breast tissue with a normal karyotype

The fifteen states

  • 1_Active_Promoter
  • 2_Weak_Promoter
  • 3_Poised_Promoter
  • 4_Strong_Enhancer
  • 5_Strong_Enhancer
  • 6_Weak_Enhancer
  • 7_Weak_Enhancer
  • 8_Insulator
  • 9_Txn_Transition
  • 10_Txn_Elongation
  • 11_Weak_Txn
  • 12_Repressed
  • 13_Heterochrom/lo
  • 14_Repetitive/CNV
  • 15_Repetitive/CNV

P300

From the ENCODE human ChIP-seq matrix, there are 2 datasets for EP300 for H1-hESC from the Myers lab. Download the peaks (broadPeak files) and raw signal (bigWig files):

wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeHaibTfbs/wgEncodeHaibTfbsH1hescP300V0416102PkRep1.broadPeak.gz
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeHaibTfbs/wgEncodeHaibTfbsH1hescP300V0416102RawRep1.bigWig
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeHaibTfbs/wgEncodeHaibTfbsH1hescP300V0416102PkRep2.broadPeak.gz
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeHaibTfbs/wgEncodeHaibTfbsH1hescP300V0416102RawRep2.bigWig

Use bigWigToBedGraph to convert from bigWig to bedGraph format:

bigWigToBedGraph - Convert from bigWig to bedGraph format.
usage:
   bigWigToBedGraph in.bigWig out.bedGraph
options:
   -chrom=chr1 - if set restrict output to given chromosome
   -start=N - if set, restrict output to only that over start
   -end=N - if set, restict output to only that under end
   -udcDir=/dir/to/cache - place to put cache for remote bigBed/bigWigs

Run bigWigToBedGraph:

bigWigToBedGraph wgEncodeHaibTfbsH1hescP300V0416102RawRep1.bigWig wgEncodeHaibTfbsH1hescP300V0416102RawRep1.bedGraph
bigWigToBedGraph wgEncodeHaibTfbsH1hescP300V0416102RawRep2.bigWig wgEncodeHaibTfbsH1hescP300V0416102RawRep2.bedGraph

Check out the bedGraph files:

head -3 *.bedGraph
==> wgEncodeHaibTfbsH1hescP300V0416102RawRep1.bedGraph <==
chr1    10245   10246   0.385339
chr1    10246   10248   0.462407
chr1    10248   10261   0.616542

==> wgEncodeHaibTfbsH1hescP300V0416102RawRep2.bedGraph <==
chr1    10156   10157   0.20465
chr1    10157   10159   0.255812
chr1    10159   10176   0.409299

Check out the broadPeak files:

zcat wgEncodeHaibTfbsH1hescP300V0416102PkRep1.broadPeak.gz | head -3
chr1    1072006 1072341 peak1   63      .       161.020 -1      -1
chr1    1590312 1590598 peak2   49      .       126.760 -1      -1
chr1    1624111 1624681 peak3   318     .       811.620 -1      -1

zcat wgEncodeHaibTfbsH1hescP300V0416102PkRep2.broadPeak.gz | head -3
chr1    839954  840316  peak1   53      .       164.470 -1      -1
chr1    856387  856832  peak2   36      .       112.670 -1      -1
chr1    937184  937561  peak3   48      .       149.600 -1      -1

ENCODE broadPeak file format -> http://genome.ucsc.edu/FAQ/FAQformat.html#format13

This format is used to provide called regions of signal enrichment based on pooled, normalized (interpreted) data. It is a BED 6+3 format.

  1. chrom - Name of the chromosome (or contig, scaffold, etc.).
  2. chromStart - The starting position of the feature in the chromosome or scaffold. The first base in a chromosome is numbered 0.
  3. chromEnd - The ending position of the feature in the chromosome or scaffold. The chromEnd base is not included in the display of the feature. For example, the first 100 bases of a chromosome are defined as chromStart=0, chromEnd=100, and span the bases numbered 0-99. If all scores were '0' when the data were submitted to the DCC, the DCC assigned scores 1-1000 based on signal value. Ideally the average # signalValue per base spread is between 100-1000.
  4. name - Name given to a region (preferably unique). Use '.' if no name is assigned.
  5. score - Indicates how dark the peak will be displayed in the browser (0-1000).
  6. strand - +/- to denote strand or orientation (whenever applicable). Use '.' if no orientation is assigned.
  7. signalValue - Measurement of overall (usually, average) enrichment for the region.
  8. pValue - Measurement of statistical significance (-log10). Use -1 if no pValue is assigned.
  9. qValue - Measurement of statistical significance using false discovery rate (-log10). Use -1 if no qValue is assigned.

Blog posts about ENCODE

http://adaptivecomplexity.blogspot.jp/2007/06/our-genomes-encode-and-intelligent.html

http://genomeinformatician.blogspot.jp/2012/09/encode-my-own-thoughts.html