ENCODE RNA polymerase II ChIP-seq

Chromatin immunoprecipitation sequencing (ChIP-seq) is a high throughput method for investigating protein-DNA interactions and aims to determine whether specific proteins are interacting with specific genomic loci. The workflow consists of crosslinking DNA and protein together, usually via the use of formaldehyde, which induces protein-DNA and protein-protein crosslinks. Importantly, these crosslinks are reversible by incubation at 70°C. Next the crosslinked DNA-protein complexes are sheared into roughly 500 bp fragments, usually by sonication. At this point we have "sheared DNA" and "sheared DNA crosslinked with proteins". Now comes the immunoprecipitation step, which is a technique that precipitates a protein antigen out of solution using an antibody that recognises that particular antigen. The crosslinking would result in many DNA-protein interactions and we use immunoprecipitation to pull down the protein of interest with the DNA region it was interacting with. After immunoprecipitation, the formaldehyde crosslinks are reversed by heating and the DNA strands are purified and sequenced. There's a nice graphic depicting this workflow at the Wikipedia article for ChIP-seq.

RNA polymerase II (RNAP II) is an enzyme that catalyses the transcription of DNA. Over at Scitable is a nice article on RNA Transcription by RNA Polymerase. Basically, RNAP II binds to the promoter of a gene to initiate transcription; however a wide range of transcription factors are also required for it to bind and initiate transcription. Therefore if we performed a ChIP-seq experiment for RNAP II, we should observe an enrichment of DNA sequences that correspond to promoters of genes. Let's see if this is true using the ENCODE RNA polymerase II ChIP-seq data for H1-hESC cells.

Let's download some ChIP-seq data; these two groups of data used two different antibodies, namely Pol2-4H8 (ab5408) and POL2 (MMS-126R):

#H1-hESC Pol2 v041610.2 ChIP-seq Peaks Rep 1 from ENCODE/HAIB
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeHaibTfbs/wgEncodeHaibTfbsH1hescPol2V0416102PkRep1.broadPeak.gz
#H1-hESC Pol2 v041610.2 ChIP-seq Raw Signal Rep 1 from ENCODE/HAIB
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeHaibTfbs/wgEncodeHaibTfbsH1hescPol2V0416102RawRep1.bigWig
#H1-hESC Pol2 v041610.2 ChIP-seq Peaks Rep 2 from ENCODE/HAIB
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeHaibTfbs/wgEncodeHaibTfbsH1hescPol2V0416102PkRep2.broadPeak.gz
#H1-hESC Pol2 v041610.2 ChIP-seq Raw Signal Rep 2 from ENCODE/HAIB
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeHaibTfbs/wgEncodeHaibTfbsH1hescPol2V0416102RawRep2.bigWig
#H1-hESC Pol2-4H8 v041610.2 ChIP-seq Peaks Rep 1 from ENCODE/HAIB
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeHaibTfbs/wgEncodeHaibTfbsH1hescPol24h8V0416102PkRep1.broadPeak.gz
#H1-hESC Pol2-4H8 v041610.2 ChIP-seq Raw Signal Rep 1 from ENCODE/HAIB
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeHaibTfbs/wgEncodeHaibTfbsH1hescPol24h8V0416102RawRep1.bigWig
#H1-hESC Pol2-4H8 v041610.2 ChIP-seq Peaks Rep 2 from ENCODE/HAIB
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeHaibTfbs/wgEncodeHaibTfbsH1hescPol24h8V0416102PkRep2.broadPeak.gz
#H1-hESC Pol2-4H8 v041610.2 ChIP-seq Raw Signal Rep 2 from ENCODE/HAIB
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeHaibTfbs/wgEncodeHaibTfbsH1hescPol24h8V0416102RawRep2.bigWig

What does the raw signal look like?

#bigWigToBedGraph can be downloaded at http://hgdownload.cse.ucsc.edu/admin/exe/
#convert one of the bigWig files into bedGraph
bigWigToBedGraph wgEncodeHaibTfbsH1hescPol2V0416102RawRep1.bigWig wgEncodeHaibTfbsH1hescPol2V0416102RawRep1.bedGraph

#have a look at the first few lines
head wgEncodeHaibTfbsH1hescPol2V0416102RawRep1.bedGraph
chr1	10074	10075	0.181626
chr1	10075	10076	0.290601
chr1	10076	10077	0.399577
chr1	10077	10079	0.508552
chr1	10079	10083	0.544878
chr1	10083	10084	0.581203
chr1	10084	10085	0.690178
chr1	10085	10086	10.3527
chr1	10086	10090	10.389
chr1	10090	10092	10.4616

The fourth column contains the raw signal, which according to ENCODE:

Shows the density of mapped reads on the plus and minus strands (wiggle format)

Obviously you can't have 0.181626 raw reads so it must be normalised. Let's get some statistics on one of the raw signal file:

#how many positions?
cat wgEncodeHaibTfbsH1hescPol2V0416102RawRep1.bedGraph | awk '{n= n + $3-$2} END {print n}'

#total signal
cat wgEncodeHaibTfbsH1hescPol2V0416102RawRep1.bedGraph | awk '{n=n+ (($3-$2)*$4)} END {print n}'

#the stats program can be downloaded here:
#clone and compile if you don't have it
#git clone https://github.com/arq5x/filo.git
#cd filo/
#mv bin/* ~/bin

#total signal using Perl and stats
cat wgEncodeHaibTfbsH1hescPol2V0416102RawRep1.bedGraph | perl -nle '@a=split; for(1..$a[2]-$a[1]){ print join("\t",@a) }' | cut -f4 | stats
Total lines:            27001019
Sum of lines:           8777668.44626473
Ari. Mean:              0.325086562335471
Geo. Mean:              0.239504165495963
Median:                 0.181626
Mode:                   0.145301 (N=9376618)
Anti-Mode:              25.6456 (N=1)
Minimum:                0.145301
Maximum:                113.226
Variance:               0.597235932726039
StdDev:                 0.77281041189029

From the stats program, we see that the fourth column does not add to 1,000,000, so the normalisation isn't simply counts per million. The arithmetic mean and median indicate that most of the 27,001,019 regions are not enriched. The signal 0.145301 occurs 9,376,618 times, so perhaps this is the normalised value of 1 raw read.

Raw signal around promoters

The beta actin gene is widely expressed and is commonly used as a control. On the hg19 genome, the coordinates of the gene are chr7:5566779-5570232 on the negative strand. Let's use Gviz package, which can read in bigWig files, to visualise the raw signal around the transcription start site of the beta actin gene:

#install if necessary
#transcript annotation package

#load libraries

gen <- 'hg19'

#region of interest
chr <- 'chr7'
start <- 5570232 - 2000
end <- 5570232 + 2000

gtrack <- GenomeAxisTrack()
itrack <- IdeogramTrack(genome = gen, chromosome = chr)
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene

raw_1 <-  DataTrack(range = 'wgEncodeHaibTfbsH1hescPol2V0416102RawRep1.bigWig',
                    genome = gen,
                    type = "l",
                    chromosome = chr,
                    name = "bigWig")

raw_2 <- DataTrack(range = 'wgEncodeHaibTfbsH1hescPol2V0416102RawRep2.bigWig',
                   genome = gen,
                   type = "l",
                   chromosome = chr,
                   name = "bigWig")

raw_3 <- DataTrack(range = 'wgEncodeHaibTfbsH1hescPol24h8V0416102RawRep1.bigWig',
                   genome = gen,
                   type = "l",
                   chromosome = chr,
                   name = "bigWig")

raw_4 <- DataTrack(range = 'wgEncodeHaibTfbsH1hescPol24h8V0416102RawRep2.bigWig',
                   genome = gen,
                   type = "l",
                   chromosome = chr,
                   name = "bigWig")

grtrack <- GeneRegionTrack(txdb,
                           genome = gen,
                           chromosome = chr,
                           name = "UCSC known genes")

plotTracks(list(itrack, gtrack, grtrack, raw_1, raw_2, raw_3, raw_4),
           from = start, to = end)

actb_pol2ENCODE RNA polymerase II ChIP-seq shows an increase in signal at the transcription starting site of the beta actin gene. I'm not sure why there are two peaks in the first raw signal track around the transcription start site.

Using the broad peak files

We have also downloaded 4 broad peak files:

ls -1 *broadPeak.gz

#check it out
zcat wgEncodeHaibTfbsH1hescPol2V0416102PkRep1.broadPeak.gz | head
chr1	9932	10575	peak1	251	.	811.800	-1	-1
chr1	10936	26830	peak2	958	.	3100.000	-1	-1
chr1	26931	31558	peak3	958	.	3100.000	-1	-1
chr1	31777	33355	peak4	176	.	570.900	-1	-1
chr1	34204	35247	peak5	95	.	308.220	-1	-1
chr1	35538	39530	peak6	958	.	3100.000	-1	-1
chr1	39906	40386	peak7	57	.	186.770	-1	-1
chr1	40824	41267	peak8	35	.	114.810	-1	-1
chr1	132449	132847	peak9	33	.	106.780	-1	-1
chr1	133615	134050	peak10	49	.	160.730	-1	-1

#no p or q-values with each peak
zcat wgEncodeHaibTfbsH1hescPol2V0416102PkRep1.broadPeak.gz | cut -f8-9 | sort -u
-1	-1

#statistics of the signal value column
zcat wgEncodeHaibTfbsH1hescPol2V0416102PkRep1.broadPeak.gz | cut -f7 | stats
Total lines:		32797
Sum of lines:		20845465.4300001
Ari. Mean:		635.590615909993
Geo. Mean:		360.900606769002
Median:			282.06
Mode:			3100 (N=1345)
Anti-Mode:		100.02 (N=1)
Minimum:		100
Maximum:		3233.06
Variance:		606730.251182131
StdDev:			778.928912791232

#statistics of the peak lengths
zcat wgEncodeHaibTfbsH1hescPol2V0416102PkRep1.broadPeak.gz | awk '{print $3-$2}' | stats
Total lines:		32797
Sum of lines:		43302957
Ari. Mean:		1320.3328658109
Geo. Mean:		1054.96441854991
Median:			981
Mode:			431 (N=53)
Anti-Mode:		164 (N=1)
Minimum:		164
Maximum:		25813
Variance:		1520563.51852621
StdDev:			1233.1113163564

Let's see how many of these broad peaks overlap known transcription start sites of RefSeq gene models:

#obtain RefSeq coordinates from UCSC
echo 'SELECT chrom,txStart,txEnd,strand FROM refGene' |
mysql -B --user=genome --host=genome-mysql.cse.ucsc.edu hg19 | gzip > refgene.tsv.gz

#check it out
zcat refgene.tsv.gz | head
chrom	txStart	txEnd	strand
chr8	6912828	6914259	-
chr2	120770603	120936697	+
chr22	35796115	35820495	+
chr5	154198051	154230213	-
chr3	148847370	148891305	+
chr5	74017028	74063196	-
chr8	144661866	144679845	-
chr7	35120898	35225774	-
chr11	27528398	27699350	+

#define TSS as -300/+100 of RefSeq gene model
zcat refgene.tsv.gz |
grep -v "^chrom" |
awk 'OFS="\t" {if ($4=="-") print $1,$3-100,$3+300; else print $1,$2-300,$2+100}' | gzip > refgene_tss.bed.gz

#how many TSSs?
zcat refgene_tss.bed.gz | wc -l

#install BEDTools if you haven't already
#see here http://davetang.org/wiki2/index.php?title=BEDTools

#check how many broad peaks intersect RefSeq TSSs
for file in `ls *broadPeak.gz`; do echo $file; zcat $file | wc -l; intersectBed -a $file -b refgene_tss.bed.gz -u | wc -l; echo "---"; done

Summation of signal

I recently discovered ngsplot and the package looks perfect for plotting all ChIP-seq signal across different functional regions in the genome. Please refer to their Google code page for installation instructions. I will download the BAM files corresponding to the two replicates of ENCODE RNA polymerase II ChIP-seq data for H1-hESC.

wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeHaibTfbs/wgEncodeHaibTfbsH1hescPol2V0416102AlnRep1.bam
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeHaibTfbs/wgEncodeHaibTfbsH1hescPol2V0416102AlnRep1.bam.bai
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeHaibTfbs/wgEncodeHaibTfbsH1hescPol2V0416102AlnRep2.bam
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeHaibTfbs/wgEncodeHaibTfbsH1hescPol2V0416102AlnRep2.bam.bai

Now simply run:

ngs.plot.r -G hg19 -R tss -C wgEncodeHaibTfbsH1hescPol2V0416102AlnRep1.bam -O blah.tss -T PolII -L 3000 -FL 300
#use imagemagick to convert pdfs to png
convert blah.tss.avgprof.pdf blah.tss.avgprof.png
convert blah.tss.heatmap.pdf blah.tss.heatmap.png

Summation of signal around the TSS:

blah.tss.avgprofEnrichment of PolII near the TSS.

Heatmap of all signal around the TSS:

blah.tss.heatmapBy eye it looks like there is signal at half of the known TSSs, which was something we deduced.


The overlap between ENCODE RNA polymerase II ChIP-seq broad peaks and transcription start sites (defined as the -300/+100 region of RefSeq transcript starts), was roughly around 30%. I choose RefSeq models because many of the models have cDNA support (except for the model RefSeqs). If we only considered broad peaks that were consistent in the replicates, with respect to the peaks in one file:

#a very rough approximation
#use only the peaks in wgEncodeHaibTfbsH1hescPol24h8V0416102PkRep2.broadPeak.gz
#that have overlaps in the other 3 replicates
intersectBed -a wgEncodeHaibTfbsH1hescPol24h8V0416102PkRep2.broadPeak.gz -b wgEncodeHaibTfbsH1hescPol24h8V0416102PkRep1.broadPeak.gz -u | intersectBed -a stdin -b wgEncodeHaibTfbsH1hescPol2V0416102PkRep2.broadPeak.gz -u | intersectBed -a stdin -b wgEncodeHaibTfbsH1hescPol2V0416102PkRep1.broadPeak.gz -u | wc -l

#how many overlap RefSeq TSSs
intersectBed -a wgEncodeHaibTfbsH1hescPol24h8V0416102PkRep2.broadPeak.gz -b wgEncodeHaibTfbsH1hescPol24h8V0416102PkRep1.broadPeak.gz -u | intersectBed -a stdin -b wgEncodeHaibTfbsH1hescPol2V0416102PkRep2.broadPeak.gz -u | intersectBed -a stdin -b wgEncodeHaibTfbsH1hescPol2V0416102PkRep1.broadPeak.gz -u | intersectBed -a stdin -b refgene_tss.bed.gz -u | wc -l

bc -l<<<11342*100/28523

If we only consider broad peaks from wgEncodeHaibTfbsH1hescPol24h8V0416102PkRep2.broadPeak.gz that overlap peaks from the other replicates (28523), the percentage of overlap to RefSeq TSSs is higher (~40%). Using the GENCODE transcript models should give a higher overlap percentage, but that's for another day.

Print Friendly, PDF & Email

Creative Commons License
This work is licensed under a Creative Commons
Attribution 4.0 International License
One comment Add yours

Leave a Reply

Your email address will not be published. Required fields are marked *