BEDTools puzzles

Updated 2015 December 16th: the files for the puzzles can be downloaded from the Amazon cloud

Roughly two weeks ago I came across this excellent BEDTools tutorial and saw some puzzles or homework questions at the end of the tutorial; naturally I tweeted about it:

I enjoy puzzles, so BEDTools puzzles is definitely my idea of fun. In this post I will go through the ten puzzles/questions, without looking at the answers initially. Let's see how I do!

We will need these files, so download them first:

curl -O
curl -O
curl -O
curl -O
curl -O
curl -O

Question one

  1. Create a BED file representing all of the intervals in the genome that are NOT exonic and not Promoters (based on the promoters in the hESC file).

I thought the easiest way was to create a single BED file containing the exonic and promoter regions, and use complementBed to find the complement.

cat hesc.chromHmm.bed | grep -i Promoter > promoter.bed
#the cut is used to keep the first three columns
cat exons.bed promoter.bed | sort -k1,1 -k2,2n | cut -f1-3 > promoter_and_exon.bed
complementBed -i promoter_and_exon.bed -g genome.txt > promoter_and_exon_complement.bed
cat promoter_and_exon_complement.bed | wc -l
#  235090

Question two

  1. What is the average distance from GWAS SNPs to the closest exon? (Hint - have a look at the closest tool.)

As hinted, we should use the closestBed tool. Since we need to calculate some statistics, I suggest installing the filo package, which was also developed by Aaron. As we want to know the distance of the closest exon to a SNP, the -b parameter will be for the exons BED file.

closestBed -a gwas.bed -b exons.bed -d | cut -f11 | stats
Total lines:		31550
Sum of lines:		1473798680
Ari. Mean:		46713.1118858954
Geo. Mean:		undef (zero found in data)
Median:			5803
Mode:			0 (N=3439)
Anti-Mode:		1 (N=1)
Minimum:		-1
Maximum:		2442918
Variance:		17694582509.7532
StdDev:			133020.985223209

The average distance of closest exons to SNPs, is 46,713 bps.

Question three

  1. Count how many exons occur in each 500kb interval (“window”) in the human genome. (Hint - have a look at the makewindows tool.)

For this we first need to make 500 kb intervals of the human genome using the windowMaker tool, then use intersectBed with the -c parameter.

windowMaker -g genome.txt -w 500000 > genome_window_500000.bed
intersectBed -a genome_window_500000.bed -b exons.bed -c | head
chr1	0	500000	37
chr1	500000	1000000	197
chr1	1000000	1500000	477
chr1	1500000	2000000	445
chr1	2000000	2500000	209
chr1	2500000	3000000	96
chr1	3000000	3500000	83
chr1	3500000	4000000	271
chr1	4000000	4500000	9
chr1	4500000	5000000	12

In the first window, i.e. chr1:0-500000, there are 37 exons.

Question four

  1. Are there any exons that are completely overlapped by an enhancer? If so, how many?

We will need to create an enhancer BED file and then use the -f parameter of intersectBed, which sets the minimum overlap required.

cat hesc.chromHmm.bed | grep -i enhancer > enhancer.bed
intersectBed -a exons.bed -b enhancer.bed -f 1 -u | wc -l

Question five

  1. What fraction of the GWAS SNPs are exonic?

I would use intersectBed to report the number of overlaps, and then calculate the fraction.

intersectBed -a gwas.bed -b exons.bed -u | wc -l
cat gwas.bed | wc -l
bc -l<<<1625/17680

Question six

  1. What fraction of the GWAS SNPs are lie in either enhancers or promoters in the hESC data we have?

For this task, I will use a feature that I didn't know existed before going through this tutorial, which is "Intersecting multiple files at once." (You will need version 2.21.0 or higher.) We can use the promoter and enhancer files that were created earlier.

intersectBed -a gwas.bed -b enhancer.bed promoter.bed -u | wc -l
cat gwas.bed | wc -l
bc -l<<<1285/17680

Question seven

  1. Create intervals representing the canonical 2bp splice sites on either side of each exon (don’t worry about excluding splice sites at the first or last exon). (Hint - have a look at the flank tool.)

I'm glad there was the hint because I've never used the flankBed tool before.

flankBed -i exons.bed -g genome.txt -l 2 -r 2 > exons_flank_2.bed
head -3 exons.bed 
chr1	11873	12227	NR_046018_exon_0_0_chr1_11874_f	0	+
chr1	12612	12721	NR_046018_exon_1_0_chr1_12613_f	0	+
chr1	13220	14409	NR_046018_exon_2_0_chr1_13221_f	0	+
head -6 exons_flank_2.bed 
chr1	11871	11873	NR_046018_exon_0_0_chr1_11874_f	0	+
chr1	12227	12229	NR_046018_exon_0_0_chr1_11874_f	0	+
chr1	12610	12612	NR_046018_exon_1_0_chr1_12613_f	0	+
chr1	12721	12723	NR_046018_exon_1_0_chr1_12613_f	0	+
chr1	13218	13220	NR_046018_exon_2_0_chr1_13221_f	0	+
chr1	14409	14411	NR_046018_exon_2_0_chr1_13221_f	0	+

Question eight

  1. What is the Jaccard statistic between CpG and hESC enhancers? Compare that to the Jaccard statistic between CpG and hESC promoters. Does the result make sense? (Hint - you will need grep).
bedtools jaccard -a cpg.bed -b enhancer.bed 
intersection	union-intersection	jaccard	n_intersections
1148180	132977386	0.0086344	4969

bedtools jaccard -a cpg.bed -b promoter.bed 
intersection	union-intersection	jaccard	n_intersections
15661111	53551816	0.292448	20402

I'm not sure why I needed grep for this. The results make sense because CpG islands are typically associated with promoters; the overlap is higher between CpG islands and promoters (Jaccard statistic of 0.292448) in contrast to CpG islands and enhancers (Jaccard statistic of 0.0086344).

Question nine

  1. What would you expect the Jaccard statistic to look like if promoters were randomly distributed throughout the genome? (Hint - you will need the shuffle tool.)

For this question I will use the cpg.bed as a reference point (since we used it as a reference in question eight).

shuffleBed -i promoter.bed -g genome.txt -seed 31 | sort -k1,1 -k2,2n > promoter_shuffle.bed
bedtools jaccard -a cpg.bed -b promoter.bed 
intersection	union-intersection	jaccard	n_intersections
15661111	53551816	0.292448	20402

bedtools jaccard -a cpg.bed -b promoter_shuffle.bed 
intersection	union-intersection	jaccard	n_intersections
329538	68533035	0.00480845	825

Question ten

  1. Which hESC ChromHMM state (e.g., 11_Weak_Txn, 10_Txn_Elongation) represents the most number of base pairs in the genome? (Hint: you will need to use awk or perl here, as well as the groupby tool.)

For this we need to calculate the coverage of each entry in the BED file, then use the groupBy tool to sum up the regions per chromatin state.

cat hesc.chromHmm.bed |\
perl -nle '@a=split; print $a[2]-$a[1],"\t$a[3]"' |\
sort -k2 |\
groupBy -g 2 -c 1 -o sum |\
sort -k2rn

13_Heterochrom/lo	1992618522
11_Weak_Txn	503587684
10_Txn_Elongation	82134508
7_Weak_Enhancer	66789380
12_Repressed	38190271
6_Weak_Enhancer	35705628
9_Txn_Transition	25674539
8_Insulator	22467670
2_Weak_Promoter	18737021
3_Poised_Promoter	18724570
1_Active_Promoter	9908594
5_Strong_Enhancer	7139201
14_Repetitive/CNV	4079101
4_Strong_Enhancer	2648615
15_Repetitive/CNV	2342560


After consulting the answers, I saw that I got the same answers. (The answer to question ten is missing, so I could be wrong.) I should also note that the suggested answers avoided the creation of intermediate files, which is definitely more desirable. For example, for question seven:

bedtools intersect -a gwas.bed -b <(egrep "Enhancer|Promoter" hesc.chromHmm.bed) -u \
| wc -l

#my solution using multiple files
intersectBed -a gwas.bed -b enhancer.bed promoter.bed -u | wc -l

In conclusion, I had fun and I award myself a B+.

Print Friendly, PDF & Email

Creative Commons License
This work is licensed under a Creative Commons
Attribution 4.0 International License
3 comments Add yours
  1. these files
    is not availaible. can you give me these files???for eaxmple share it in github??Thanks

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.