ENCODE RNA polymerase II ChIP-seq

Updated 2018 November 8th: include section on MACS2

Chromatin immunoprecipitation sequencing (ChIP-seq) is a high throughput method for investigating protein-DNA interactions and aims to determine whether specific proteins are interacting with specific genomic loci. The workflow consists of crosslinking DNA and protein together, usually via the use of formaldehyde, which induces protein-DNA and protein-protein crosslinks. Importantly, these crosslinks are reversible by incubation at 70°C. Next the crosslinked DNA-protein complexes are sheared into roughly 500 bp fragments, usually by sonication. At this point we have "sheared DNA" and "sheared DNA crosslinked with proteins". Now comes the immunoprecipitation step, which is a technique that precipitates a protein antigen out of solution using an antibody that recognises that particular antigen. The crosslinking would result in many DNA-protein interactions and we use immunoprecipitation to pull down the protein of interest with the DNA region it was interacting with. After immunoprecipitation, the formaldehyde crosslinks are reversed by heating and the DNA strands are purified and sequenced. There's a nice graphic depicting this workflow at the Wikipedia article for ChIP-seq.

RNA polymerase II (RNAP II) is an enzyme that catalyses the transcription of DNA. Over at Scitable is a nice article on RNA Transcription by RNA Polymerase. Basically, RNAP II binds to the promoter of a gene to initiate transcription; however a wide range of transcription factors are also required for it to bind and initiate transcription. Therefore if we performed a ChIP-seq experiment for RNAP II, we should observe an enrichment of DNA sequences that correspond to promoters of genes. Let's see if this is true using the ENCODE RNA polymerase II ChIP-seq data for H1-hESC cells.

Let's download some ChIP-seq data; these two groups of data used two different antibodies, namely Pol2-4H8 (ab5408) and POL2 (MMS-126R):

#H1-hESC Pol2 v041610.2 ChIP-seq Peaks Rep 1 from ENCODE/HAIB
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeHaibTfbs/wgEncodeHaibTfbsH1hescPol2V0416102PkRep1.broadPeak.gz
#H1-hESC Pol2 v041610.2 ChIP-seq Raw Signal Rep 1 from ENCODE/HAIB
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeHaibTfbs/wgEncodeHaibTfbsH1hescPol2V0416102RawRep1.bigWig
#H1-hESC Pol2 v041610.2 ChIP-seq Peaks Rep 2 from ENCODE/HAIB
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeHaibTfbs/wgEncodeHaibTfbsH1hescPol2V0416102PkRep2.broadPeak.gz
#H1-hESC Pol2 v041610.2 ChIP-seq Raw Signal Rep 2 from ENCODE/HAIB
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeHaibTfbs/wgEncodeHaibTfbsH1hescPol2V0416102RawRep2.bigWig
#H1-hESC Pol2-4H8 v041610.2 ChIP-seq Peaks Rep 1 from ENCODE/HAIB
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeHaibTfbs/wgEncodeHaibTfbsH1hescPol24h8V0416102PkRep1.broadPeak.gz
#H1-hESC Pol2-4H8 v041610.2 ChIP-seq Raw Signal Rep 1 from ENCODE/HAIB
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeHaibTfbs/wgEncodeHaibTfbsH1hescPol24h8V0416102RawRep1.bigWig
#H1-hESC Pol2-4H8 v041610.2 ChIP-seq Peaks Rep 2 from ENCODE/HAIB
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeHaibTfbs/wgEncodeHaibTfbsH1hescPol24h8V0416102PkRep2.broadPeak.gz
#H1-hESC Pol2-4H8 v041610.2 ChIP-seq Raw Signal Rep 2 from ENCODE/HAIB
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeHaibTfbs/wgEncodeHaibTfbsH1hescPol24h8V0416102RawRep2.bigWig

What does the raw signal look like?

#bigWigToBedGraph can be downloaded at http://hgdownload.cse.ucsc.edu/admin/exe/
#convert one of the bigWig files into bedGraph
bigWigToBedGraph wgEncodeHaibTfbsH1hescPol2V0416102RawRep1.bigWig wgEncodeHaibTfbsH1hescPol2V0416102RawRep1.bedGraph

#have a look at the first few lines
head wgEncodeHaibTfbsH1hescPol2V0416102RawRep1.bedGraph
chr1	10074	10075	0.181626
chr1	10075	10076	0.290601
chr1	10076	10077	0.399577
chr1	10077	10079	0.508552
chr1	10079	10083	0.544878
chr1	10083	10084	0.581203
chr1	10084	10085	0.690178
chr1	10085	10086	10.3527
chr1	10086	10090	10.389
chr1	10090	10092	10.4616

The fourth column contains the raw signal, which according to ENCODE:

Shows the density of mapped reads on the plus and minus strands (wiggle format)

Obviously you can't have 0.181626 raw reads so it must be normalised. Let's get some statistics on one of the raw signal file:

#how many positions?
cat wgEncodeHaibTfbsH1hescPol2V0416102RawRep1.bedGraph | awk '{n= n + $3-$2} END {print n}'
27001019

#total signal
cat wgEncodeHaibTfbsH1hescPol2V0416102RawRep1.bedGraph | awk '{n=n+ (($3-$2)*$4)} END {print n}'
8.77767e+06

#the stats program can be downloaded here:
#https://github.com/arq5x/filo
#clone and compile if you don't have it
#git clone https://github.com/arq5x/filo.git
#cd filo/
#make
#mv bin/* ~/bin

#total signal using Perl and stats
cat wgEncodeHaibTfbsH1hescPol2V0416102RawRep1.bedGraph | perl -nle '@a=split; for(1..$a[2]-$a[1]){ print join("\t",@a) }' | cut -f4 | stats
Total lines:            27001019
Sum of lines:           8777668.44626473
Ari. Mean:              0.325086562335471
Geo. Mean:              0.239504165495963
Median:                 0.181626
Mode:                   0.145301 (N=9376618)
Anti-Mode:              25.6456 (N=1)
Minimum:                0.145301
Maximum:                113.226
Variance:               0.597235932726039
StdDev:                 0.77281041189029

From the stats program, we see that the fourth column does not add to 1,000,000, so the normalisation isn't simply counts per million. The arithmetic mean and median indicate that most of the 27,001,019 regions are not enriched. The signal 0.145301 occurs 9,376,618 times, so perhaps this is the normalised value of 1 raw read.

Raw signal around promoters

The beta actin gene is widely expressed and is commonly used as a control. On the hg19 genome, the coordinates of the gene are chr7:5566779-5570232 on the negative strand. Let's use Gviz package, which can read in bigWig files, to visualise the raw signal around the transcription start site of the beta actin gene:

#install if necessary
source("http://bioconductor.org/biocLite.R")
biocLite("Gviz")
#transcript annotation package
biocLite("TxDb.Hsapiens.UCSC.hg19.knownGene")

#load libraries
library(Gviz)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)

#genome
gen <- 'hg19'

#region of interest
chr <- 'chr7'
start <- 5570232 - 2000
end <- 5570232 + 2000

gtrack <- GenomeAxisTrack()
itrack <- IdeogramTrack(genome = gen, chromosome = chr)
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene

raw_1 <-  DataTrack(range = 'wgEncodeHaibTfbsH1hescPol2V0416102RawRep1.bigWig',
                    genome = gen,
                    type = "l",
                    chromosome = chr,
                    name = "bigWig")

raw_2 <- DataTrack(range = 'wgEncodeHaibTfbsH1hescPol2V0416102RawRep2.bigWig',
                   genome = gen,
                   type = "l",
                   chromosome = chr,
                   name = "bigWig")

raw_3 <- DataTrack(range = 'wgEncodeHaibTfbsH1hescPol24h8V0416102RawRep1.bigWig',
                   genome = gen,
                   type = "l",
                   chromosome = chr,
                   name = "bigWig")

raw_4 <- DataTrack(range = 'wgEncodeHaibTfbsH1hescPol24h8V0416102RawRep2.bigWig',
                   genome = gen,
                   type = "l",
                   chromosome = chr,
                   name = "bigWig")

grtrack <- GeneRegionTrack(txdb,
                           genome = gen,
                           chromosome = chr,
                           name = "UCSC known genes")

plotTracks(list(itrack, gtrack, grtrack, raw_1, raw_2, raw_3, raw_4),
           from = start, to = end)

actb_pol2ENCODE RNA polymerase II ChIP-seq shows an increase in signal at the transcription starting site of the beta actin gene. I'm not sure why there are two peaks in the first raw signal track around the transcription start site.

Using the broad peak files

We have also downloaded 4 broad peak files:

ls -1 *broadPeak.gz
wgEncodeHaibTfbsH1hescPol24h8V0416102PkRep1.broadPeak.gz
wgEncodeHaibTfbsH1hescPol24h8V0416102PkRep2.broadPeak.gz
wgEncodeHaibTfbsH1hescPol2V0416102PkRep1.broadPeak.gz
wgEncodeHaibTfbsH1hescPol2V0416102PkRep2.broadPeak.gz

#check it out
zcat wgEncodeHaibTfbsH1hescPol2V0416102PkRep1.broadPeak.gz | head
chr1	9932	10575	peak1	251	.	811.800	-1	-1
chr1	10936	26830	peak2	958	.	3100.000	-1	-1
chr1	26931	31558	peak3	958	.	3100.000	-1	-1
chr1	31777	33355	peak4	176	.	570.900	-1	-1
chr1	34204	35247	peak5	95	.	308.220	-1	-1
chr1	35538	39530	peak6	958	.	3100.000	-1	-1
chr1	39906	40386	peak7	57	.	186.770	-1	-1
chr1	40824	41267	peak8	35	.	114.810	-1	-1
chr1	132449	132847	peak9	33	.	106.780	-1	-1
chr1	133615	134050	peak10	49	.	160.730	-1	-1

#no p or q-values with each peak
zcat wgEncodeHaibTfbsH1hescPol2V0416102PkRep1.broadPeak.gz | cut -f8-9 | sort -u
-1	-1

#statistics of the signal value column
zcat wgEncodeHaibTfbsH1hescPol2V0416102PkRep1.broadPeak.gz | cut -f7 | stats
Total lines:		32797
Sum of lines:		20845465.4300001
Ari. Mean:		635.590615909993
Geo. Mean:		360.900606769002
Median:			282.06
Mode:			3100 (N=1345)
Anti-Mode:		100.02 (N=1)
Minimum:		100
Maximum:		3233.06
Variance:		606730.251182131
StdDev:			778.928912791232

#statistics of the peak lengths
zcat wgEncodeHaibTfbsH1hescPol2V0416102PkRep1.broadPeak.gz | awk '{print $3-$2}' | stats
Total lines:		32797
Sum of lines:		43302957
Ari. Mean:		1320.3328658109
Geo. Mean:		1054.96441854991
Median:			981
Mode:			431 (N=53)
Anti-Mode:		164 (N=1)
Minimum:		164
Maximum:		25813
Variance:		1520563.51852621
StdDev:			1233.1113163564

Let's see how many of these broad peaks overlap known transcription start sites of RefSeq gene models:

#obtain RefSeq coordinates from UCSC
echo 'SELECT chrom,txStart,txEnd,strand FROM refGene' |
mysql -B --user=genome --host=genome-mysql.cse.ucsc.edu hg19 | gzip > refgene.tsv.gz

#check it out
zcat refgene.tsv.gz | head
chrom	txStart	txEnd	strand
chr8	6912828	6914259	-
chr2	120770603	120936697	+
chr22	35796115	35820495	+
chr5	154198051	154230213	-
chr3	148847370	148891305	+
chr5	74017028	74063196	-
chr8	144661866	144679845	-
chr7	35120898	35225774	-
chr11	27528398	27699350	+

#define TSS as -300/+100 of RefSeq gene model
zcat refgene.tsv.gz |
grep -v "^chrom" |
awk 'OFS="\t" {if ($4=="-") print $1,$3-100,$3+300; else print $1,$2-300,$2+100}' | gzip > refgene_tss.bed.gz

#how many TSSs?
zcat refgene_tss.bed.gz | wc -l
50026

#install BEDTools if you haven't already
#see here http://davetang.org/wiki2/index.php?title=BEDTools

#check how many broad peaks intersect RefSeq TSSs
for file in `ls *broadPeak.gz`; do echo $file; zcat $file | wc -l; intersectBed -a $file -b refgene_tss.bed.gz -u | wc -l; echo "---"; done
wgEncodeHaibTfbsH1hescPol24h8V0416102PkRep1.broadPeak.gz
42559
11281
---
wgEncodeHaibTfbsH1hescPol24h8V0416102PkRep2.broadPeak.gz
91979
14167
---
wgEncodeHaibTfbsH1hescPol2V0416102PkRep1.broadPeak.gz
32797
10991
---
wgEncodeHaibTfbsH1hescPol2V0416102PkRep2.broadPeak.gz
41908
12123
---

Summation of signal

I recently discovered ngsplot and the package looks perfect for plotting all ChIP-seq signal across different functional regions in the genome. Please refer to their Google code page for installation instructions. I will download the BAM files corresponding to the two replicates of ENCODE RNA polymerase II ChIP-seq data for H1-hESC.

wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeHaibTfbs/wgEncodeHaibTfbsH1hescPol2V0416102AlnRep1.bam
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeHaibTfbs/wgEncodeHaibTfbsH1hescPol2V0416102AlnRep1.bam.bai
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeHaibTfbs/wgEncodeHaibTfbsH1hescPol2V0416102AlnRep2.bam
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeHaibTfbs/wgEncodeHaibTfbsH1hescPol2V0416102AlnRep2.bam.bai

Now simply run:

ngs.plot.r -G hg19 -R tss -C wgEncodeHaibTfbsH1hescPol2V0416102AlnRep1.bam -O blah.tss -T PolII -L 3000 -FL 300
#use imagemagick to convert pdfs to png
convert blah.tss.avgprof.pdf blah.tss.avgprof.png
convert blah.tss.heatmap.pdf blah.tss.heatmap.png

Summation of signal around the TSS:

blah.tss.avgprofEnrichment of PolII near the TSS.

Heatmap of all signal around the TSS:

blah.tss.heatmapBy eye it looks like there is signal at half of the known TSSs, which was something we deduced.

Calling your own peaks

If you are not satisfied with the peak files you can call your own peaks using the BAM files we downloaded above. We will use MACS2 to call peaks.

MACS2 requires Python 2.7 and if you are like me and using Python 3, you can use virtualenv to create an isolated Python environment.

pip install virtualenv
virtualenv --version
16.0.0

# create new virtual environment using the system's Python installation
virtualenv -p /usr/bin/python2.7 run_macs2
source run_macs2/bin/activate

# dependencies are not automatically installed
# if I just run: pip install MACS2
# CRITICAL:Numpy must be installed!
pip install numpy
pip install MACS2
macs2 --version
macs2 2.1.2

We'll use the subcommand callpeak to call peaks; the parameters used are:

-t = ChIP-seq treatment file
-g = Effective genome size
-n = Experiment name
–outdir = Output directory

macs2 callpeak -t wgEncodeHaibTfbsH1hescPol2V0416102AlnRep1.bam -g 2.7e9 -n H1hescPol2V0416102AlnRep1 --outdir rep1
macs2 callpeak -t wgEncodeHaibTfbsH1hescPol2V0416102AlnRep2.bam -g 2.7e9 -n H1hescPol2V0416102AlnRep2 --outdir rep2

You should get four files once macs2 has finished running.

ls -1 rep1
H1hescPol2V0416102AlnRep1_model.r
H1hescPol2V0416102AlnRep1_peaks.narrowPeak
H1hescPol2V0416102AlnRep1_peaks.xls
H1hescPol2V0416102AlnRep1_summits.bed

head -3 rep1/H1hescPol2V0416102AlnRep1_peaks.narrowPeak 
chr1    10061   10371   H1hescPol2V0416102AlnRep1_peak_1        140     .       3.93745 16.86583        14.03859        100
chr1    13174   13764   H1hescPol2V0416102AlnRep1_peak_2        147     .       3.31520 17.64385        14.79405        402
chr1    17064   17259   H1hescPol2V0416102AlnRep1_peak_3        53      .       2.27226 7.84569 5.38032 33

Compare our peaks with the ENCODE peaks using our summits file.

# how many summits in replicate 1?
cat rep1/H1hescPol2V0416102AlnRep1_summits.bed | wc -l
49731

# create a simple BED file from the broad peak file
zcat wgEncodeHaibTfbsH1hescPol2V0416102PkRep1.broadPeak.gz | cut -f1-3 > a.bed
cat a.bed | wc -l
32797

# intersect
bedtools intersect -a rep1/H1hescPol2V0416102AlnRep1_summits.bed -b a.bed -u | wc -l
30563

bc -l<<<30563/32797*100
93.18840137817483306300

# how many summits in replicate 2?
cat rep2/H1hescPol2V0416102AlnRep2_summits.bed | wc -l
86964

# create a simple BED file from the broad peak file
zcat wgEncodeHaibTfbsH1hescPol2V0416102PkRep2.broadPeak.gz | cut -f1-3 > b.bed
cat b.bed | wc -l
41908

# intersect
bedtools intersect -a rep2/H1hescPol2V0416102AlnRep2_summits.bed -b b.bed -u | wc -l
38350

bc -l<<<38350/41908*100
91.50997422926410231900

Over 90% of the ENCODE peaks overlapped our peaks.

Lastly, remember to deactivate your environment when you're done (if you used virtualenv).

deactivate

Conclusions

The overlap between ENCODE RNA polymerase II ChIP-seq broad peaks and transcription start sites (defined as the -300/+100 region of RefSeq transcript starts), was roughly around 30%. I choose RefSeq models because many of the models have cDNA support (except for the model RefSeqs). If we only considered broad peaks that were consistent in the replicates, with respect to the peaks in one file:

#a very rough approximation
#use only the peaks in wgEncodeHaibTfbsH1hescPol24h8V0416102PkRep2.broadPeak.gz
#that have overlaps in the other 3 replicates
intersectBed -a wgEncodeHaibTfbsH1hescPol24h8V0416102PkRep2.broadPeak.gz -b wgEncodeHaibTfbsH1hescPol24h8V0416102PkRep1.broadPeak.gz -u | intersectBed -a stdin -b wgEncodeHaibTfbsH1hescPol2V0416102PkRep2.broadPeak.gz -u | intersectBed -a stdin -b wgEncodeHaibTfbsH1hescPol2V0416102PkRep1.broadPeak.gz -u | wc -l
28523

#how many overlap RefSeq TSSs
intersectBed -a wgEncodeHaibTfbsH1hescPol24h8V0416102PkRep2.broadPeak.gz -b wgEncodeHaibTfbsH1hescPol24h8V0416102PkRep1.broadPeak.gz -u | intersectBed -a stdin -b wgEncodeHaibTfbsH1hescPol2V0416102PkRep2.broadPeak.gz -u | intersectBed -a stdin -b wgEncodeHaibTfbsH1hescPol2V0416102PkRep1.broadPeak.gz -u | intersectBed -a stdin -b refgene_tss.bed.gz -u | wc -l
11342

bc -l<<<11342*100/28523
39.76440065911720366020

If we only consider broad peaks from wgEncodeHaibTfbsH1hescPol24h8V0416102PkRep2.broadPeak.gz that overlap peaks from the other replicates (28523), the percentage of overlap to RefSeq TSSs is higher (~40%). Using the GENCODE transcript models should give a higher overlap percentage, but that's for another day.

Print Friendly, PDF & Email



Creative Commons License
This work is licensed under a Creative Commons
Attribution 4.0 International License
.
One comment Add yours

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.