Chromatin immunoprecipitation sequencing (ChIP-seq) is a high throughput method for investigating protein-DNA interactions and aims to determine whether specific proteins are interacting with specific genomic loci. The workflow consists of crosslinking DNA and protein together, usually via the use of formaldehyde, which induces protein-DNA and protein-protein crosslinks. Importantly, these crosslinks are reversible by incubation at 70°C. Next the crosslinked DNA-protein complexes are sheared into roughly 500 bp fragments, usually by sonication. At this point we have "sheared DNA" and "sheared DNA crosslinked with proteins". Now comes the immunoprecipitation step, which is a technique that precipitates a protein antigen out of solution using an antibody that recognises that particular antigen. The crosslinking would result in many DNA-protein interactions and we use immunoprecipitation to pull down the protein of interest with the DNA region it was interacting with. After immunoprecipitation, the formaldehyde crosslinks are reversed by heating and the DNA strands are purified and sequenced. There's a nice graphic depicting this workflow at the Wikipedia article for ChIP-seq.
Updated 2014 October 8th to include an analysis using CAGE data from ENCODE and rewrote parts of the post.
In this post I will write about a read clustering method called paraclu, which allows mapped reads to be clustered together. This is particularly useful when working with CAGE data, where transcription start sites (TSSs) are known to initiate at different positions but are all providing information on the promoter activity of a transcript, so it is useful to cluster the TSSs together. In addition, paraclu allows different levels of clustering, so you can choose the level that you want. Furthermore, studying the clusters at different levels can reveal subtle properties of promoters; this is akin to adjusting the bin size of histograms to see if certain properties arise.
Is there any nucleotide bias with the -40 region of RefSeqs?
Taking all hg19 RefSeqs that mapped to assembled chromosomes (36,004) and extracting the nucleotide sequences 40 bp upstream of the RefSeq gene model, I generated a sequence logo.
No obvious TATA box enrichment, which was expected since only 10-20% of genes in eukaryotes have a TATA box (perhaps at -13 to -16?). Note the enrichment of a cytosine at -1.
Then I took the -20 and +20 sequences and generated the same sequence logo plot.
Note the enrichment of purines (adenine and guanine) at the 5' UTR start (position 21).