ENCODE
The Encyclopedia Of DNA Elements (ENCODE) project began in September 2003 as an initiative to identify all functional elements in the human genome sequence[1]. The first phase of this project examined roughly 1% or 30 million DNA bases by various experimental tests and computational analyses. In addition they found that 5% of the genome is under evolutionary constraint; of this 5%:
- 40% consists of protein-coding genes
- 20% consists of known, functional, non-coding elements
- 40% consists of sequence with no known function
- ↑ ENCODE overview http://www.genome.gov/10005107
Links to articles
The articles below are also included in the free ENCODE iPad app.
Overview articles
Genomics: ENCODE explained -> http://www.nature.com/nature/journal/v489/n7414/full/489052a.html
The making of ENCODE: Lessons for big-data projects -> http://www.nature.com/nature/journal/v489/n7414/full/489049a.html
ENCODE: The human encyclopaedia -> http://www.nature.com/news/encode-the-human-encyclopaedia-1.11312
Research articles
Links to data
ENCODE human data matrix -> https://genome.ucsc.edu/ENCODE/dataMatrix/encodeDataMatrixHuman.html
ENCODE human ChIP-seq matrix -> https://genome.ucsc.edu/ENCODE/dataMatrix/encodeChipMatrixHuman.html
ENCODE cell lines -> http://genome.ucsc.edu/ENCODE/cellTypes.html
Chromatin states
Mapping and analysis of chromatin state dynamics in nine human cell types: http://www.ncbi.nlm.nih.gov/pubmed/21441907
Cell line information: see http://genome.ucsc.edu/ENCODE/cellTypes.html
- H1ES - H1 human embryonic stem cells
- K562 - an immortalized cell line produced from a female patient with chronic myelogenous leukemia (CML)
- GM12878 - a lymphoblastoid cell line produced from the blood of a female donor with northern and western European ancestry by EBV transformation
- HepG2 - a cell line derived from a male patient with liver carcinoma
- HUVEC - human umbilical vein endothelial cells have a normal karyotype
- HSMM - skeletal muscle myoblasts from the mesoderm lineage and muscle tissue with a normal karyotype
- NHLF - lung fibroblasts from the endoderm lineage and lung tissue with a normal karyotype
- NHEK - epidermal keratinocytes from the ectoderm lineage and skin with a normal karyotype
- HMEC - mammary epithelial cells from the ectoderm lineage and breast tissue with a normal karyotype
The fifteen states
- 1_Active_Promoter
- 2_Weak_Promoter
- 3_Poised_Promoter
- 4_Strong_Enhancer
- 5_Strong_Enhancer
- 6_Weak_Enhancer
- 7_Weak_Enhancer
- 8_Insulator
- 9_Txn_Transition
- 10_Txn_Elongation
- 11_Weak_Txn
- 12_Repressed
- 13_Heterochrom/lo
- 14_Repetitive/CNV
- 15_Repetitive/CNV
P300
From the ENCODE human ChIP-seq matrix, there are 2 datasets for EP300 for H1-hESC from the Myers lab. Download the peaks (broadPeak files) and raw signal (bigWig files):
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeHaibTfbs/wgEncodeHaibTfbsH1hescP300V0416102PkRep1.broadPeak.gz wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeHaibTfbs/wgEncodeHaibTfbsH1hescP300V0416102RawRep1.bigWig wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeHaibTfbs/wgEncodeHaibTfbsH1hescP300V0416102PkRep2.broadPeak.gz wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeHaibTfbs/wgEncodeHaibTfbsH1hescP300V0416102RawRep2.bigWig
Use bigWigToBedGraph to convert from bigWig to bedGraph format:
bigWigToBedGraph - Convert from bigWig to bedGraph format. usage: bigWigToBedGraph in.bigWig out.bedGraph options: -chrom=chr1 - if set restrict output to given chromosome -start=N - if set, restrict output to only that over start -end=N - if set, restict output to only that under end -udcDir=/dir/to/cache - place to put cache for remote bigBed/bigWigs
Run bigWigToBedGraph:
bigWigToBedGraph wgEncodeHaibTfbsH1hescP300V0416102RawRep1.bigWig wgEncodeHaibTfbsH1hescP300V0416102RawRep1.bedGraph bigWigToBedGraph wgEncodeHaibTfbsH1hescP300V0416102RawRep2.bigWig wgEncodeHaibTfbsH1hescP300V0416102RawRep2.bedGraph
Check out the bedGraph files:
head -3 *.bedGraph ==> wgEncodeHaibTfbsH1hescP300V0416102RawRep1.bedGraph <== chr1 10245 10246 0.385339 chr1 10246 10248 0.462407 chr1 10248 10261 0.616542 ==> wgEncodeHaibTfbsH1hescP300V0416102RawRep2.bedGraph <== chr1 10156 10157 0.20465 chr1 10157 10159 0.255812 chr1 10159 10176 0.409299
Check out the broadPeak files:
zcat wgEncodeHaibTfbsH1hescP300V0416102PkRep1.broadPeak.gz | head -3 chr1 1072006 1072341 peak1 63 . 161.020 -1 -1 chr1 1590312 1590598 peak2 49 . 126.760 -1 -1 chr1 1624111 1624681 peak3 318 . 811.620 -1 -1 zcat wgEncodeHaibTfbsH1hescP300V0416102PkRep2.broadPeak.gz | head -3 chr1 839954 840316 peak1 53 . 164.470 -1 -1 chr1 856387 856832 peak2 36 . 112.670 -1 -1 chr1 937184 937561 peak3 48 . 149.600 -1 -1
ENCODE broadPeak file format -> http://genome.ucsc.edu/FAQ/FAQformat.html#format13
This format is used to provide called regions of signal enrichment based on pooled, normalized (interpreted) data. It is a BED 6+3 format.
- chrom - Name of the chromosome (or contig, scaffold, etc.).
- chromStart - The starting position of the feature in the chromosome or scaffold. The first base in a chromosome is numbered 0.
- chromEnd - The ending position of the feature in the chromosome or scaffold. The chromEnd base is not included in the display of the feature. For example, the first 100 bases of a chromosome are defined as chromStart=0, chromEnd=100, and span the bases numbered 0-99. If all scores were '0' when the data were submitted to the DCC, the DCC assigned scores 1-1000 based on signal value. Ideally the average # signalValue per base spread is between 100-1000.
- name - Name given to a region (preferably unique). Use '.' if no name is assigned.
- score - Indicates how dark the peak will be displayed in the browser (0-1000).
- strand - +/- to denote strand or orientation (whenever applicable). Use '.' if no orientation is assigned.
- signalValue - Measurement of overall (usually, average) enrichment for the region.
- pValue - Measurement of statistical significance (-log10). Use -1 if no pValue is assigned.
- qValue - Measurement of statistical significance using false discovery rate (-log10). Use -1 if no qValue is assigned.
Blog posts about ENCODE
http://adaptivecomplexity.blogspot.jp/2007/06/our-genomes-encode-and-intelligent.html
http://genomeinformatician.blogspot.jp/2012/09/encode-my-own-thoughts.html