Updated 2018 November 8th: include section on MACS2
Chromatin immunoprecipitation sequencing (ChIP-seq) is a high throughput method for investigating protein-DNA interactions and aims to determine whether specific proteins are interacting with specific genomic loci. The workflow consists of crosslinking DNA and protein together, usually via the use of formaldehyde, which induces protein-DNA and protein-protein crosslinks. Importantly, these crosslinks are reversible by incubation at 70°C. Next the crosslinked DNA-protein complexes are sheared into roughly 500 bp fragments, usually by sonication. At this point we have "sheared DNA" and "sheared DNA crosslinked with proteins". Now comes the immunoprecipitation step, which is a technique that precipitates a protein antigen out of solution using an antibody that recognises that particular antigen. The crosslinking would result in many DNA-protein interactions and we use immunoprecipitation to pull down the protein of interest with the DNA region it was interacting with. After immunoprecipitation, the formaldehyde crosslinks are reversed by heating and the DNA strands are purified and sequenced. There's a nice graphic depicting this workflow at the Wikipedia article for ChIP-seq.
RNA polymerase II (RNAP II) is an enzyme that catalyses the transcription of DNA. Over at Scitable is a nice article on RNA Transcription by RNA Polymerase. Basically, RNAP II binds to the promoter of a gene to initiate transcription; however a wide range of transcription factors are also required for it to bind and initiate transcription. Therefore if we performed a ChIP-seq experiment for RNAP II, we should observe an enrichment of DNA sequences that correspond to promoters of genes. Let's see if this is true using the ENCODE RNA polymerase II ChIP-seq data for H1-hESC cells.
Let's download some ChIP-seq data; these two groups of data used two different antibodies, namely Pol2-4H8 (ab5408) and POL2 (MMS-126R):
#H1-hESC Pol2 v041610.2 ChIP-seq Peaks Rep 1 from ENCODE/HAIB wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeHaibTfbs/wgEncodeHaibTfbsH1hescPol2V0416102PkRep1.broadPeak.gz #H1-hESC Pol2 v041610.2 ChIP-seq Raw Signal Rep 1 from ENCODE/HAIB wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeHaibTfbs/wgEncodeHaibTfbsH1hescPol2V0416102RawRep1.bigWig #H1-hESC Pol2 v041610.2 ChIP-seq Peaks Rep 2 from ENCODE/HAIB wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeHaibTfbs/wgEncodeHaibTfbsH1hescPol2V0416102PkRep2.broadPeak.gz #H1-hESC Pol2 v041610.2 ChIP-seq Raw Signal Rep 2 from ENCODE/HAIB wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeHaibTfbs/wgEncodeHaibTfbsH1hescPol2V0416102RawRep2.bigWig #H1-hESC Pol2-4H8 v041610.2 ChIP-seq Peaks Rep 1 from ENCODE/HAIB wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeHaibTfbs/wgEncodeHaibTfbsH1hescPol24h8V0416102PkRep1.broadPeak.gz #H1-hESC Pol2-4H8 v041610.2 ChIP-seq Raw Signal Rep 1 from ENCODE/HAIB wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeHaibTfbs/wgEncodeHaibTfbsH1hescPol24h8V0416102RawRep1.bigWig #H1-hESC Pol2-4H8 v041610.2 ChIP-seq Peaks Rep 2 from ENCODE/HAIB wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeHaibTfbs/wgEncodeHaibTfbsH1hescPol24h8V0416102PkRep2.broadPeak.gz #H1-hESC Pol2-4H8 v041610.2 ChIP-seq Raw Signal Rep 2 from ENCODE/HAIB wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeHaibTfbs/wgEncodeHaibTfbsH1hescPol24h8V0416102RawRep2.bigWig
What does the raw signal look like?
#bigWigToBedGraph can be downloaded at http://hgdownload.cse.ucsc.edu/admin/exe/ #convert one of the bigWig files into bedGraph bigWigToBedGraph wgEncodeHaibTfbsH1hescPol2V0416102RawRep1.bigWig wgEncodeHaibTfbsH1hescPol2V0416102RawRep1.bedGraph #have a look at the first few lines head wgEncodeHaibTfbsH1hescPol2V0416102RawRep1.bedGraph chr1 10074 10075 0.181626 chr1 10075 10076 0.290601 chr1 10076 10077 0.399577 chr1 10077 10079 0.508552 chr1 10079 10083 0.544878 chr1 10083 10084 0.581203 chr1 10084 10085 0.690178 chr1 10085 10086 10.3527 chr1 10086 10090 10.389 chr1 10090 10092 10.4616
The fourth column contains the raw signal, which according to ENCODE:
Shows the density of mapped reads on the plus and minus strands (wiggle format)
Obviously you can't have 0.181626 raw reads so it must be normalised. Let's get some statistics on one of the raw signal file:
#how many positions? cat wgEncodeHaibTfbsH1hescPol2V0416102RawRep1.bedGraph | awk '{n= n + $3-$2} END {print n}' 27001019 #total signal cat wgEncodeHaibTfbsH1hescPol2V0416102RawRep1.bedGraph | awk '{n=n+ (($3-$2)*$4)} END {print n}' 8.77767e+06 #the stats program can be downloaded here: #https://github.com/arq5x/filo #clone and compile if you don't have it #git clone https://github.com/arq5x/filo.git #cd filo/ #make #mv bin/* ~/bin #total signal using Perl and stats cat wgEncodeHaibTfbsH1hescPol2V0416102RawRep1.bedGraph | perl -nle '@a=split; for(1..$a[2]-$a[1]){ print join("\t",@a) }' | cut -f4 | stats Total lines: 27001019 Sum of lines: 8777668.44626473 Ari. Mean: 0.325086562335471 Geo. Mean: 0.239504165495963 Median: 0.181626 Mode: 0.145301 (N=9376618) Anti-Mode: 25.6456 (N=1) Minimum: 0.145301 Maximum: 113.226 Variance: 0.597235932726039 StdDev: 0.77281041189029
From the stats program, we see that the fourth column does not add to 1,000,000, so the normalisation isn't simply counts per million. The arithmetic mean and median indicate that most of the 27,001,019 regions are not enriched. The signal 0.145301 occurs 9,376,618 times, so perhaps this is the normalised value of 1 raw read.
Raw signal around promoters
The beta actin gene is widely expressed and is commonly used as a control. On the hg19 genome, the coordinates of the gene are chr7:5566779-5570232 on the negative strand. Let's use Gviz package, which can read in bigWig files, to visualise the raw signal around the transcription start site of the beta actin gene:
#install if necessary source("http://bioconductor.org/biocLite.R") biocLite("Gviz") #transcript annotation package biocLite("TxDb.Hsapiens.UCSC.hg19.knownGene") #load libraries library(Gviz) library(TxDb.Hsapiens.UCSC.hg19.knownGene) #genome gen <- 'hg19' #region of interest chr <- 'chr7' start <- 5570232 - 2000 end <- 5570232 + 2000 gtrack <- GenomeAxisTrack() itrack <- IdeogramTrack(genome = gen, chromosome = chr) txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene raw_1 <- DataTrack(range = 'wgEncodeHaibTfbsH1hescPol2V0416102RawRep1.bigWig', genome = gen, type = "l", chromosome = chr, name = "bigWig") raw_2 <- DataTrack(range = 'wgEncodeHaibTfbsH1hescPol2V0416102RawRep2.bigWig', genome = gen, type = "l", chromosome = chr, name = "bigWig") raw_3 <- DataTrack(range = 'wgEncodeHaibTfbsH1hescPol24h8V0416102RawRep1.bigWig', genome = gen, type = "l", chromosome = chr, name = "bigWig") raw_4 <- DataTrack(range = 'wgEncodeHaibTfbsH1hescPol24h8V0416102RawRep2.bigWig', genome = gen, type = "l", chromosome = chr, name = "bigWig") grtrack <- GeneRegionTrack(txdb, genome = gen, chromosome = chr, name = "UCSC known genes") plotTracks(list(itrack, gtrack, grtrack, raw_1, raw_2, raw_3, raw_4), from = start, to = end)
ENCODE RNA polymerase II ChIP-seq shows an increase in signal at the transcription starting site of the beta actin gene. I'm not sure why there are two peaks in the first raw signal track around the transcription start site.
Using the broad peak files
We have also downloaded 4 broad peak files:
ls -1 *broadPeak.gz wgEncodeHaibTfbsH1hescPol24h8V0416102PkRep1.broadPeak.gz wgEncodeHaibTfbsH1hescPol24h8V0416102PkRep2.broadPeak.gz wgEncodeHaibTfbsH1hescPol2V0416102PkRep1.broadPeak.gz wgEncodeHaibTfbsH1hescPol2V0416102PkRep2.broadPeak.gz #check it out zcat wgEncodeHaibTfbsH1hescPol2V0416102PkRep1.broadPeak.gz | head chr1 9932 10575 peak1 251 . 811.800 -1 -1 chr1 10936 26830 peak2 958 . 3100.000 -1 -1 chr1 26931 31558 peak3 958 . 3100.000 -1 -1 chr1 31777 33355 peak4 176 . 570.900 -1 -1 chr1 34204 35247 peak5 95 . 308.220 -1 -1 chr1 35538 39530 peak6 958 . 3100.000 -1 -1 chr1 39906 40386 peak7 57 . 186.770 -1 -1 chr1 40824 41267 peak8 35 . 114.810 -1 -1 chr1 132449 132847 peak9 33 . 106.780 -1 -1 chr1 133615 134050 peak10 49 . 160.730 -1 -1 #no p or q-values with each peak zcat wgEncodeHaibTfbsH1hescPol2V0416102PkRep1.broadPeak.gz | cut -f8-9 | sort -u -1 -1 #statistics of the signal value column zcat wgEncodeHaibTfbsH1hescPol2V0416102PkRep1.broadPeak.gz | cut -f7 | stats Total lines: 32797 Sum of lines: 20845465.4300001 Ari. Mean: 635.590615909993 Geo. Mean: 360.900606769002 Median: 282.06 Mode: 3100 (N=1345) Anti-Mode: 100.02 (N=1) Minimum: 100 Maximum: 3233.06 Variance: 606730.251182131 StdDev: 778.928912791232 #statistics of the peak lengths zcat wgEncodeHaibTfbsH1hescPol2V0416102PkRep1.broadPeak.gz | awk '{print $3-$2}' | stats Total lines: 32797 Sum of lines: 43302957 Ari. Mean: 1320.3328658109 Geo. Mean: 1054.96441854991 Median: 981 Mode: 431 (N=53) Anti-Mode: 164 (N=1) Minimum: 164 Maximum: 25813 Variance: 1520563.51852621 StdDev: 1233.1113163564
Let's see how many of these broad peaks overlap known transcription start sites of RefSeq gene models:
#obtain RefSeq coordinates from UCSC echo 'SELECT chrom,txStart,txEnd,strand FROM refGene' | mysql -B --user=genome --host=genome-mysql.cse.ucsc.edu hg19 | gzip > refgene.tsv.gz #check it out zcat refgene.tsv.gz | head chrom txStart txEnd strand chr8 6912828 6914259 - chr2 120770603 120936697 + chr22 35796115 35820495 + chr5 154198051 154230213 - chr3 148847370 148891305 + chr5 74017028 74063196 - chr8 144661866 144679845 - chr7 35120898 35225774 - chr11 27528398 27699350 + #define TSS as -300/+100 of RefSeq gene model zcat refgene.tsv.gz | grep -v "^chrom" | awk 'OFS="\t" {if ($4=="-") print $1,$3-100,$3+300; else print $1,$2-300,$2+100}' | gzip > refgene_tss.bed.gz #how many TSSs? zcat refgene_tss.bed.gz | wc -l 50026 #install BEDTools if you haven't already #see here http://davetang.org/wiki2/index.php?title=BEDTools #check how many broad peaks intersect RefSeq TSSs for file in `ls *broadPeak.gz`; do echo $file; zcat $file | wc -l; intersectBed -a $file -b refgene_tss.bed.gz -u | wc -l; echo "---"; done wgEncodeHaibTfbsH1hescPol24h8V0416102PkRep1.broadPeak.gz 42559 11281 --- wgEncodeHaibTfbsH1hescPol24h8V0416102PkRep2.broadPeak.gz 91979 14167 --- wgEncodeHaibTfbsH1hescPol2V0416102PkRep1.broadPeak.gz 32797 10991 --- wgEncodeHaibTfbsH1hescPol2V0416102PkRep2.broadPeak.gz 41908 12123 ---
Summation of signal
I recently discovered ngsplot and the package looks perfect for plotting all ChIP-seq signal across different functional regions in the genome. Please refer to their Google code page for installation instructions. I will download the BAM files corresponding to the two replicates of ENCODE RNA polymerase II ChIP-seq data for H1-hESC.
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeHaibTfbs/wgEncodeHaibTfbsH1hescPol2V0416102AlnRep1.bam wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeHaibTfbs/wgEncodeHaibTfbsH1hescPol2V0416102AlnRep1.bam.bai wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeHaibTfbs/wgEncodeHaibTfbsH1hescPol2V0416102AlnRep2.bam wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeHaibTfbs/wgEncodeHaibTfbsH1hescPol2V0416102AlnRep2.bam.bai
Now simply run:
ngs.plot.r -G hg19 -R tss -C wgEncodeHaibTfbsH1hescPol2V0416102AlnRep1.bam -O blah.tss -T PolII -L 3000 -FL 300 #use imagemagick to convert pdfs to png convert blah.tss.avgprof.pdf blah.tss.avgprof.png convert blah.tss.heatmap.pdf blah.tss.heatmap.png
Summation of signal around the TSS:
Enrichment of PolII near the TSS.
Heatmap of all signal around the TSS:
By eye it looks like there is signal at half of the known TSSs, which was something we deduced.
Calling your own peaks
If you are not satisfied with the peak files you can call your own peaks using the BAM files we downloaded above. We will use MACS2 to call peaks.
MACS2 requires Python 2.7 and if you are like me and using Python 3, you can use virtualenv to create an isolated Python environment.
pip install virtualenv virtualenv --version 16.0.0 # create new virtual environment using the system's Python installation virtualenv -p /usr/bin/python2.7 run_macs2 source run_macs2/bin/activate # dependencies are not automatically installed # if I just run: pip install MACS2 # CRITICAL:Numpy must be installed! pip install numpy pip install MACS2 macs2 --version macs2 2.1.2
We'll use the subcommand callpeak to call peaks; the parameters used are:
-t = ChIP-seq treatment file
-g = Effective genome size
-n = Experiment name
–outdir = Output directory
macs2 callpeak -t wgEncodeHaibTfbsH1hescPol2V0416102AlnRep1.bam -g 2.7e9 -n H1hescPol2V0416102AlnRep1 --outdir rep1 macs2 callpeak -t wgEncodeHaibTfbsH1hescPol2V0416102AlnRep2.bam -g 2.7e9 -n H1hescPol2V0416102AlnRep2 --outdir rep2
You should get four files once macs2 has finished running.
ls -1 rep1 H1hescPol2V0416102AlnRep1_model.r H1hescPol2V0416102AlnRep1_peaks.narrowPeak H1hescPol2V0416102AlnRep1_peaks.xls H1hescPol2V0416102AlnRep1_summits.bed head -3 rep1/H1hescPol2V0416102AlnRep1_peaks.narrowPeak chr1 10061 10371 H1hescPol2V0416102AlnRep1_peak_1 140 . 3.93745 16.86583 14.03859 100 chr1 13174 13764 H1hescPol2V0416102AlnRep1_peak_2 147 . 3.31520 17.64385 14.79405 402 chr1 17064 17259 H1hescPol2V0416102AlnRep1_peak_3 53 . 2.27226 7.84569 5.38032 33
Compare our peaks with the ENCODE peaks using our summits file.
# how many summits in replicate 1? cat rep1/H1hescPol2V0416102AlnRep1_summits.bed | wc -l 49731 # create a simple BED file from the broad peak file zcat wgEncodeHaibTfbsH1hescPol2V0416102PkRep1.broadPeak.gz | cut -f1-3 > a.bed cat a.bed | wc -l 32797 # intersect bedtools intersect -a rep1/H1hescPol2V0416102AlnRep1_summits.bed -b a.bed -u | wc -l 30563 bc -l<<<30563/32797*100 93.18840137817483306300 # how many summits in replicate 2? cat rep2/H1hescPol2V0416102AlnRep2_summits.bed | wc -l 86964 # create a simple BED file from the broad peak file zcat wgEncodeHaibTfbsH1hescPol2V0416102PkRep2.broadPeak.gz | cut -f1-3 > b.bed cat b.bed | wc -l 41908 # intersect bedtools intersect -a rep2/H1hescPol2V0416102AlnRep2_summits.bed -b b.bed -u | wc -l 38350 bc -l<<<38350/41908*100 91.50997422926410231900
Over 90% of the ENCODE peaks overlapped our peaks.
Lastly, remember to deactivate your environment when you're done (if you used virtualenv).
deactivate
Conclusions
The overlap between ENCODE RNA polymerase II ChIP-seq broad peaks and transcription start sites (defined as the -300/+100 region of RefSeq transcript starts), was roughly around 30%. I choose RefSeq models because many of the models have cDNA support (except for the model RefSeqs). If we only considered broad peaks that were consistent in the replicates, with respect to the peaks in one file:
#a very rough approximation #use only the peaks in wgEncodeHaibTfbsH1hescPol24h8V0416102PkRep2.broadPeak.gz #that have overlaps in the other 3 replicates intersectBed -a wgEncodeHaibTfbsH1hescPol24h8V0416102PkRep2.broadPeak.gz -b wgEncodeHaibTfbsH1hescPol24h8V0416102PkRep1.broadPeak.gz -u | intersectBed -a stdin -b wgEncodeHaibTfbsH1hescPol2V0416102PkRep2.broadPeak.gz -u | intersectBed -a stdin -b wgEncodeHaibTfbsH1hescPol2V0416102PkRep1.broadPeak.gz -u | wc -l 28523 #how many overlap RefSeq TSSs intersectBed -a wgEncodeHaibTfbsH1hescPol24h8V0416102PkRep2.broadPeak.gz -b wgEncodeHaibTfbsH1hescPol24h8V0416102PkRep1.broadPeak.gz -u | intersectBed -a stdin -b wgEncodeHaibTfbsH1hescPol2V0416102PkRep2.broadPeak.gz -u | intersectBed -a stdin -b wgEncodeHaibTfbsH1hescPol2V0416102PkRep1.broadPeak.gz -u | intersectBed -a stdin -b refgene_tss.bed.gz -u | wc -l 11342 bc -l<<<11342*100/28523 39.76440065911720366020
If we only consider broad peaks from wgEncodeHaibTfbsH1hescPol24h8V0416102PkRep2.broadPeak.gz that overlap peaks from the other replicates (28523), the percentage of overlap to RefSeq TSSs is higher (~40%). Using the GENCODE transcript models should give a higher overlap percentage, but that's for another day.

This work is licensed under a Creative Commons
Attribution 4.0 International License.
Excellent, Thanks a ton!