Updated 2015 December 16th: the files for the puzzles can be downloaded from the Amazon cloud
Roughly two weeks ago I came across this excellent BEDTools tutorial and saw some puzzles or homework questions at the end of the tutorial; naturally I tweeted about it:
Oh this looks like fun; BEDTools puzzles http://t.co/UMowwdPLEf
— Dave Tang (@davetang31) December 4, 2014
I enjoy puzzles, so BEDTools puzzles is definitely my idea of fun. In this post I will go through the ten puzzles/questions, without looking at the answers initially. Let's see how I do!
We will need these files, so download them first:
curl -O https://s3.amazonaws.com/bedtools-tutorials/web/maurano.dnaseI.tgz curl -O https://s3.amazonaws.com/bedtools-tutorials/web/cpg.bed curl -O https://s3.amazonaws.com/bedtools-tutorials/web/exons.bed curl -O https://s3.amazonaws.com/bedtools-tutorials/web/gwas.bed curl -O https://s3.amazonaws.com/bedtools-tutorials/web/genome.txt curl -O https://s3.amazonaws.com/bedtools-tutorials/web/hesc.chromHmm.bed
- Create a BED file representing all of the intervals in the genome that are NOT exonic and not Promoters (based on the promoters in the hESC file).
I thought the easiest way was to create a single BED file containing the exonic and promoter regions, and use complementBed to find the complement.
cat hesc.chromHmm.bed | grep -i Promoter > promoter.bed #the cut is used to keep the first three columns cat exons.bed promoter.bed | sort -k1,1 -k2,2n | cut -f1-3 > promoter_and_exon.bed complementBed -i promoter_and_exon.bed -g genome.txt > promoter_and_exon_complement.bed cat promoter_and_exon_complement.bed | wc -l # 235090
- What is the average distance from GWAS SNPs to the closest exon? (Hint - have a look at the closest tool.)
As hinted, we should use the closestBed tool. Since we need to calculate some statistics, I suggest installing the filo package, which was also developed by Aaron. As we want to know the distance of the closest exon to a SNP, the -b parameter will be for the exons BED file.
closestBed -a gwas.bed -b exons.bed -d | cut -f11 | stats Total lines: 31550 Sum of lines: 1473798680 Ari. Mean: 46713.1118858954 Geo. Mean: undef (zero found in data) Median: 5803 Mode: 0 (N=3439) Anti-Mode: 1 (N=1) Minimum: -1 Maximum: 2442918 Variance: 17694582509.7532 StdDev: 133020.985223209
The average distance of closest exons to SNPs, is 46,713 bps.
- Count how many exons occur in each 500kb interval (“window”) in the human genome. (Hint - have a look at the makewindows tool.)
For this we first need to make 500 kb intervals of the human genome using the windowMaker tool, then use intersectBed with the -c parameter.
windowMaker -g genome.txt -w 500000 > genome_window_500000.bed intersectBed -a genome_window_500000.bed -b exons.bed -c | head chr1 0 500000 37 chr1 500000 1000000 197 chr1 1000000 1500000 477 chr1 1500000 2000000 445 chr1 2000000 2500000 209 chr1 2500000 3000000 96 chr1 3000000 3500000 83 chr1 3500000 4000000 271 chr1 4000000 4500000 9 chr1 4500000 5000000 12
In the first window, i.e. chr1:0-500000, there are 37 exons.
- Are there any exons that are completely overlapped by an enhancer? If so, how many?
We will need to create an enhancer BED file and then use the -f parameter of intersectBed, which sets the minimum overlap required.
cat hesc.chromHmm.bed | grep -i enhancer > enhancer.bed intersectBed -a exons.bed -b enhancer.bed -f 1 -u | wc -l 13746
- What fraction of the GWAS SNPs are exonic?
I would use intersectBed to report the number of overlaps, and then calculate the fraction.
intersectBed -a gwas.bed -b exons.bed -u | wc -l 1625 cat gwas.bed | wc -l 17680 bc -l<<<1625/17680 .09191176470588235294
- What fraction of the GWAS SNPs are lie in either enhancers or promoters in the hESC data we have?
For this task, I will use a feature that I didn't know existed before going through this tutorial, which is "Intersecting multiple files at once." (You will need version 2.21.0 or higher.) We can use the promoter and enhancer files that were created earlier.
intersectBed -a gwas.bed -b enhancer.bed promoter.bed -u | wc -l 1285 cat gwas.bed | wc -l 17680 bc -l<<<1285/17680 .07268099547511312217
- Create intervals representing the canonical 2bp splice sites on either side of each exon (don’t worry about excluding splice sites at the first or last exon). (Hint - have a look at the flank tool.)
I'm glad there was the hint because I've never used the flankBed tool before.
flankBed -i exons.bed -g genome.txt -l 2 -r 2 > exons_flank_2.bed head -3 exons.bed chr1 11873 12227 NR_046018_exon_0_0_chr1_11874_f 0 + chr1 12612 12721 NR_046018_exon_1_0_chr1_12613_f 0 + chr1 13220 14409 NR_046018_exon_2_0_chr1_13221_f 0 + head -6 exons_flank_2.bed chr1 11871 11873 NR_046018_exon_0_0_chr1_11874_f 0 + chr1 12227 12229 NR_046018_exon_0_0_chr1_11874_f 0 + chr1 12610 12612 NR_046018_exon_1_0_chr1_12613_f 0 + chr1 12721 12723 NR_046018_exon_1_0_chr1_12613_f 0 + chr1 13218 13220 NR_046018_exon_2_0_chr1_13221_f 0 + chr1 14409 14411 NR_046018_exon_2_0_chr1_13221_f 0 +
- What is the Jaccard statistic between CpG and hESC enhancers? Compare that to the Jaccard statistic between CpG and hESC promoters. Does the result make sense? (Hint - you will need grep).
bedtools jaccard -a cpg.bed -b enhancer.bed intersection union-intersection jaccard n_intersections 1148180 132977386 0.0086344 4969 bedtools jaccard -a cpg.bed -b promoter.bed intersection union-intersection jaccard n_intersections 15661111 53551816 0.292448 20402
I'm not sure why I needed grep for this. The results make sense because CpG islands are typically associated with promoters; the overlap is higher between CpG islands and promoters (Jaccard statistic of 0.292448) in contrast to CpG islands and enhancers (Jaccard statistic of 0.0086344).
- What would you expect the Jaccard statistic to look like if promoters were randomly distributed throughout the genome? (Hint - you will need the shuffle tool.)
For this question I will use the cpg.bed as a reference point (since we used it as a reference in question eight).
shuffleBed -i promoter.bed -g genome.txt -seed 31 | sort -k1,1 -k2,2n > promoter_shuffle.bed bedtools jaccard -a cpg.bed -b promoter.bed intersection union-intersection jaccard n_intersections 15661111 53551816 0.292448 20402 bedtools jaccard -a cpg.bed -b promoter_shuffle.bed intersection union-intersection jaccard n_intersections 329538 68533035 0.00480845 825
- Which hESC ChromHMM state (e.g., 11_Weak_Txn, 10_Txn_Elongation) represents the most number of base pairs in the genome? (Hint: you will need to use awk or perl here, as well as the groupby tool.)
For this we need to calculate the coverage of each entry in the BED file, then use the groupBy tool to sum up the regions per chromatin state.
cat hesc.chromHmm.bed |\ perl -nle '@a=split; print $a-$a,"\t$a"' |\ sort -k2 |\ groupBy -g 2 -c 1 -o sum |\ sort -k2rn 13_Heterochrom/lo 1992618522 11_Weak_Txn 503587684 10_Txn_Elongation 82134508 7_Weak_Enhancer 66789380 12_Repressed 38190271 6_Weak_Enhancer 35705628 9_Txn_Transition 25674539 8_Insulator 22467670 2_Weak_Promoter 18737021 3_Poised_Promoter 18724570 1_Active_Promoter 9908594 5_Strong_Enhancer 7139201 14_Repetitive/CNV 4079101 4_Strong_Enhancer 2648615 15_Repetitive/CNV 2342560
After consulting the answers, I saw that I got the same answers. (The answer to question ten is missing, so I could be wrong.) I should also note that the suggested answers avoided the creation of intermediate files, which is definitely more desirable. For example, for question seven:
bedtools intersect -a gwas.bed -b <(egrep "Enhancer|Promoter" hesc.chromHmm.bed) -u \ | wc -l 1285 #my solution using multiple files intersectBed -a gwas.bed -b enhancer.bed promoter.bed -u | wc -l 1285
In conclusion, I had fun and I award myself a B+.
This work is licensed under a Creative Commons
Attribution 4.0 International License.