For the past week I've been attending the:

Karolinska Institutet - RIKEN Joint International Doctoral Course on "Epigenomics: Methods and Applications to Disease and Development"

Today is the last day of the course and the course participants have to come up with a proposal that combines epigenetics and its application to a particular disease (our group chose cancer). I've written notes about epigenetics previously here and also about H3K27ac. Here, I'd just like to share some things I've learned in the past week.

Each time I'm introduced to epigenetics I'm shown this image:

epigenetic_landscapeThe epigenetic landscape

and I'm told that epigenetics refers to the heritable modifications that are not directly related to the DNA sequence. Perhaps I'm thicker than most people but the first time I read "not directly related to the DNA sequence", I got confused. Here's how I would have preferred it to be explained:

DNA is a physical entity inside a cell and as such is subjected to biochemical and mechanical forces that act upon it. Depending on the type of force, DNA will be altered such as from DNA methylation, a biochemical reaction. This alteration has downstream consequences on whether the DNA can be accessed and processed into RNA. Note that the DNA sequence is not directly manipulated but only the physical properties of DNA. These manipulations are referred to as epigenetics and are heritable and regulated.

I'm not sure how accurate my definition is but if epigenetics was explained like that to me, I would have appreciated the term more. In relation to the epigenetic landscape image, I would relate back to my description and explain how these different types of forces can push the ball down into different paths that ultimately determine the fate of the ball/cell.

So what are these "forces" that I referred to? In our course we were first introduced to DNA methylation and histone modification, and they would be two types of forces that act upon DNA. DNA methylation is the covalent addition of a methyl group to the C5 position of cytosine within CpG dinucleotides, which are often clustered as CpG islands in the promoter regions of genes. The speaker also introduced a paper that described CpG shores, which are further away from CpG islands, and have a lower density of CG nucleotides, which are sometimes are much more methylated/demethylated that CpG islands. DNA methylation is an important regulator of gene transcription and usually, high levels of methylation in the promoter region of genes results in gene silencing. However, yesterday's presentation showed an example where DNA methylation resulted in the inability of CTCF to bind to its target, resulting in a downstream enhancer being able to active its target. So there are always exceptions to the rules.

When I was first reading about histones, it wasn't clear to me what role histones play in gene regulation. So I randomly chose a histone modification and wrote about it anyway. I am quite surprised to see that when I Google H3K27ac, my post came up on the front page of results because surely there are much better references out there on the internet. Anyway, histones are proteins that have these tails that are normally positively charged due to amine groups present on their lysine and arginine amino acids. DNA is negatively charged so they normally stick to histones i.e. a mechanical force that packages DNA. The charge of the tails can be altered via histone modifications and DNA can be accessible or inaccessible depending on the charge of the histone tail. It turns out that different types of chemical modifications on the histones tails govern the role of that particular DNA sequence. You may have seen in papers talking about different histone modifications, such as H3K4me3 or H3K9ac. See my tweet for a nice table that summarises different histone marks and their associated functions.

During the week we were introduced to different types of technologies that could check the methylation status of DNA, the types of histone modifications and chromatin conformation. I learned of the Illumina methylation arrays and relearned ChIP-Seq. We were also introduced to MeDIP, bisulfite sequencing, DNase I hypersensitive assays, FAIRE-Seq, MNase-Seq and ChIA-PET. I won't write about these technologies.

So how does DNA methylation and histone modifications relate to cancer? It has been known that hypomethylation of oncogenes and hypermethylation of tumour suppressor genes have been known to cause cancer. So genes that should be methylated are not i.e. are activated, and genes that suppress tumours are turned off i.e. deactivated. Since histone modifications have a large role in governing the role of DNA sequence, aberrant modifications that result in turning on oncogenes and turning off tumour suppressor genes can also result in cancer formation. There is also another indirect level in that the genes important for carrying out DNA methylation (DMNT), or histone acetylation (histone acetyltransferase) and deacetylation (histone deacetylase) can be mutated thus resulting in cancer.

Encyclopedia of DNA elements (ENCODE)

ENCODE is an abbreviation of "Encyclopedia of DNA elements", a project that aims to discover and define the functional elements encoded in the human genome via multiple technologies. Used in this context, the term "functional elements" is used to denote a discrete region of the genome that encodes a defined product (e.g., protein) or a reproducible biochemical signature, such as transcription or a specific chromatin structure. It is now widely appreciated that such signatures, either alone or in combination, mark genomic sequences with important functions, including exons, sites of RNA processing, and transcriptional regulatory elements such as promoters, enhancers, silencers, and insulators.

The data included from the ENCODE project include:

  1. Identification and quantification of RNA species in whole cells and in sub-cellular compartments
  2. Mapping of protein-coding regions
  3. Delineation of chromatin and DNA accessibility and structure with nucleases and chemical probes
  4. Mapping of histone modifications and transcription factor binding sites (TFBSs) by chromatin immunoprecipitation (ChIP)
  5. and measurement of DNA methylation

To facilitate comparison and integration of data, ENCODE data production efforts have prioritised selected sets of cell types. The highest priority set (designated "Tier 1) includes two widely studied immortalised cell lines - K562 erythroleukemia cells; an EBV-immortalised B-lymphoblastoid line and the H1 human embryonic stem cell line. A secondary priority set (Tier 2) includes HeLa-S3 cervical carcinoma cells, HepG2 hepatoblastoma cells, and primary (non-transformed) human umbilical vein endothelial cells (HUVEC), which have limited proliferation potential in culture. A third set (Tier 3) currently comprises more than 100 cell types that are being analysed in selected assays.

A major goal of ENCODE is to manually annotate all protein-coding genes, pseudogenes, and non-coding transcribed loci in the human genome and to catalog the products of transcription including splice isoforms. This annotation process involves consolidation of all evidence of transcripts (cDNA, EST sequences) and proteins from public databases, followed by building gene structures based on supporting experimental data.

Ultimately, each gene or transcript model is assigned one of three confidence levels:

  1. Level 1 includes genes validated by RT-PCR and sequencing, plus consensus psuedogenes.
  2. Level 2 includes manually annotated coding and long non-coding loci that have transcriptional evidence in EMBL/GenBank.
  3. Level 3 includes Ensembl gene predictions in regions not yet manually annotated or for which there is new transcriptional evidence.

The result of ENCODE gene annotation (termed "GENCODE") is a comprehensive catalog of transcripts and gene models.

Another goal of ENCODE is to produce a comprehensive genome-wide catalog of transcribed loci that characterises the size, polyadenylation status, and subcellular compartmentalisation of all transcripts. Both polyA+ and polyA- RNAs are being analysed and not only total whole cell RNAs but also those concentrated in the nucleus and cytosol. Long (>200nt) and short RNAs (<200nt) are being sequenced from each subcellular compartment, providing catalogs of potential miRNAs, snoRNA, promoter-associated short RNAs (PASRs) and other short cellular RNAs. Cis-regulatory regions include diverse functional elements (e.g., promoters, enhancers, silencers, and insulators) that collectively modulate the magnitude, timing, and cell-specificity of gene expression. The ENCODE Project is using multiple approaches to identify cis-regulatory regions, including localising their characteristic chromatin signatures and identifying sites of occupancy of sequence-specific transcription factors. Human cis-regulatory regions characteristically exhibit nuclease hypersensitivity and may show increased solubility after chromatin fixation and fragmentation. Additionally, specific patterns of post-translational histone modifications have been connected with distinct classes of regions such as promoters and enhancers. Chromatin accessibility and histone modifications thus provide independent and complementary annotations of human regulatory DNA. DNaseI hypersensitive sites (DHSs) are being mapped by two techniques: (i) capture of free DNA ends at in vivo DNaseI cleavage sites with biotinylated adapters, followed by digestion with a TypeIIS restiction enzyme to generate ~20 bp DNaseI cleavage site tags and (ii) direct sequencing of DNaseI cleavage sites at the ends of small (<300 bp) DNA fragments released by limiting treatment with DNaseI. For more information see the ENCODE user guide.

Downloading ENCODE data

Release ENCODE data can be download at http://genome.ucsc.edu/ENCODE/downloads.html.

For example if you are interested in the H3K4Me1 CHiP-Seq data, bigWig files are provided at http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeRegMarkH3k4me1/. You may also be interested in the H3K4Me3 data, since this provides some perspective on the H3K4Me1 data.

To convert the bigWig file to a human readable (i.e. non-binary), download the bigWigToBedGraph executable.

Then convert the bigWig file to bedGraph by running:

bigWigToBedGraph wgEncodeBroadHistoneK562H3k4me1StdSig.bigWig wgEncodeBroadHistoneK562H3k4me1StdSig.bedgraph

The output:

chr1 10100 10125 0.6
chr1 10125 10225 1
chr1 10225 10250 1.36
chr1 10250 10275 2
chr1 10275 10300 3.64

Which can then be transformed into a BED file, for use with intersectBed.


Mainly sourced from Wikipedia but arranged as per my train of thought.

Histones are highly alkaline proteins found in eukaryotic cell nuclei that package and order the DNA into structural units called nucleosomes. They are the chief protein components of chromatin, acting as spools around which DNA winds, and play a role in gene regulation. Histone H3 is one of the core histone proteins (the others are H2A, H2B and H4) involved in the structure of chromatin in eukaryotic cells. H3 is involved with the structure of the nucleosomes of the 'beads on a string' structure and H3 is the most extensively modified of the five histones.

Nucleosomes are the basic unit of DNA packaging in eukaryotes, consisting of a segment of DNA wound around a histone protein core. This structure is often compared to thread wrapped around a spool.

Chromatin is the combination of DNA and proteins that make up the contents of the nucleus of a cell. The primary functions of chromatin are; to package DNA into a smaller volume to fit in the cell, to strengthen the DNA to allow mitosis and meiosis and prevent DNA damage, and to control gene expression and DNA replication.

The common nomenclature of histone modifications is:

The name of the histone (e.g. H3)
The single-letter amino acid abbreviation (e.g., K for Lysine) and the amino acid position in the protein
The type of modification (Me: methyl, P: phosphate, Ac: acetyl, Ub: ubiquitin)

So H3K27Ac denotes the acetylation of the 27th residue (a lysine) from the start (i.e. the N-terminal) of the H3 protein.

Histone acetyltransferases (HAT) are enzymes that acetylate conserved lysine amino acids on histone proteins by transferring an acetyl group from acetyl CoA. In general, histone acetylation is linked to transcriptional activation and associated with euchromatin. Histone modification levels and gene expression are well correlated; the levels of a single modification (H3K27ac) can be used to faithfully model gene expression (Karlic et al., 2010 PNAS).

Chemical modifications (e.g. methylation and acylation) to the histone proteins present in chromatin influence gene expression by changing how accessible the chromatin is to transcription. A specific modification of a specific histone protein is called a histone mark. This track shows the levels of enrichment of the H3K27Ac histone mark across the genome as determined by a ChIP-seq assay. The H3K27Ac histone mark is the acetylation of lysine 27 of the H3 histone protein, and it is thought to enhance transcription possibly by blocking the spread of the repressive histone mark H3K27Me3 (from the ENCODE track on the UCSC Genome Browser).

ChIP-sequencing, also known as ChIP-seq, is used to analyze protein interactions with DNA. ChIP-seq combines chromatin immunoprecipitation (ChIP) with massively parallel DNA sequencing to identify the binding sites of DNA-associated proteins. It can be used to precisely map global binding sites for any protein of interest. Previously, ChIP-on-chip was the most common technique utilized to study these protein–DNA relations.

The ChIP process enriches specific crosslinked DNA-protein complexes using an antibody against a protein of interest. It can be used to precisely map global binding sites for any protein of interest.

See also:

ChIP-seq: welcome to the new frontier