intersectBed

Updated 2014 June 25th

The tool intersectBed is part of the BEDTools suite of tools and performs an intersection between two BED files. For example, given two BED files, you may be interested in finding the entries that overlap. To install the latest version of BEDTools, download the source code from GitHub and compile:

git clone https://github.com/arq5x/bedtools2.git
cd bedtools2
make clean; make all
bin/bedtools --version
#bedtools v2.20.1-4-gb877b35

BEDTools usage examples are available in the official documentation.

Getting started

When I first used BEDTools in 2010, intersectBed was a separate program that was part of the BEDTools suite. Since then, BEDTools is released as one main program called bedtools. If we look inside intersectBed, it is actually just a shell script calling bedtools:

cat intersectBed 
#!/bin/sh
${0%/*}/bedtools intersect "$@"

The $@ symbol stores all the input arguments supplied, for example:

cat blah.sh 
#!/bin/sh
echo "$@"

blah.sh -a a.bed -b b.bed
-a a.bed -b b.bed

The ${0%/*} refers to $0 using parameter expansion

cat blah.sh 
#!/bin/sh
echo ${0} "$@"

blah.sh -a a.bed -b b.bed
./blah.sh -a a.bed -b b.bed

and the %/* removes everything before the /

cat blah.sh 
#!/bin/sh
echo ${0%/*} "$@"

blah.sh -a a.bed -b b.bed
. -a a.bed -b b.bed

So in the end:

cat blah.sh 
#!/bin/sh
echo ${0%/*}/bedtools intersect "$@"

blah.sh -a a.bed -b b.bed
./bedtools intersect -a a.bed -b b.bed

I’m used to calling intersectBed, it’s exactly the same as calling bedtools intersect. When we call intersectBed without any parameters, we get the usage on STDERR. We can redirect STDERR to STDOUT, so that we can use a terminal pager:

intersectBed 2>&1 | less

Here is the entire usage:

Tool:    bedtools intersect (aka intersectBed)
Version: v2.20.1-4-gb877b35
Summary: Report overlaps between two feature files.

Usage:   bedtools intersect [OPTIONS] -a  -b 

Options: 
	-abam	The A input file is in BAM format.  Output will be BAM as well.

	-ubam	Write uncompressed BAM output. Default writes compressed BAM.

	-bed	When using BAM input (-abam), write output as BED. The default
		is to write output in BAM when using -abam.

	-wa	Write the original entry in A for each overlap.

	-wb	Write the original entry in B for each overlap.
		- Useful for knowing _what_ A overlaps. Restricted by -f and -r.

	-loj	Perform a "left outer join". That is, for each feature in A
		report each overlap with B.  If no overlaps are found, 
		report a NULL feature for B.

	-wo	Write the original A and B entries plus the number of base
		pairs of overlap between the two features.
		- Overlaps restricted by -f and -r.
		  Only A features with overlap are reported.

	-wao	Write the original A and B entries plus the number of base
		pairs of overlap between the two features.
		- Overlapping features restricted by -f and -r.
		  However, A features w/o overlap are also reported
		  with a NULL B feature and overlap = 0.

	-u	Write the original A entry _once_ if _any_ overlaps found in B.
		- In other words, just report the fact >=1 hit was found.
		- Overlaps restricted by -f and -r.

	-c	For each entry in A, report the number of overlaps with B.
		- Reports 0 for A entries that have no overlap with B.
		- Overlaps restricted by -f and -r.

	-v	Only report those entries in A that have _no overlaps_ with B.
		- Similar to "grep -v" (an homage).

	-f	Minimum overlap required as a fraction of A.
		- Default is 1E-9 (i.e., 1bp).
		- FLOAT (e.g. 0.50)

	-r	Require that the fraction overlap be reciprocal for A and B.
		- In other words, if -f is 0.90 and -r is used, this requires
		  that B overlap 90% of A and A _also_ overlaps 90% of B.

	-s	Require same strandedness.  That is, only report hits in B
		that overlap A on the _same_ strand.
		- By default, overlaps are reported without respect to strand.

	-S	Require different strandedness.  That is, only report hits in B
		that overlap A on the _opposite_ strand.
		- By default, overlaps are reported without respect to strand.

	-split	Treat "split" BAM or BED12 entries as distinct BED intervals.

	-sorted	Use the "chromsweep" algorithm for sorted (-k1,1 -k2,2n) input.

	-g	Provide a genome file to enforce consistent chromosome sort order
		across input files. Only applies when used with -sorted option.

	-header	Print the header from the A file prior to results.

	-nobuf	Disable buffered output. Using this option will cause each line
		of output to be printed as it is generated, rather than saved
		in a buffer. This will make printing large output files 
		noticeably slower, but can be useful in conjunction with
		other software tools and scripts that need to process one
		line of bedtools output at a time.

Notes: 
	(1) When a BAM file is used for the A file, the alignment is retained if overlaps exist,
	and exlcuded if an overlap cannot be found.  If multiple overlaps exist, they are not
	reported, as we are only testing for one or more overlaps.

Something to remember is that:

1.3.6 File B is loaded into memory
Whenever a BEDTool compares two files of features, the “B” file is loaded into memory. By contrast, the “A” file is processed line by line and compared with the features from B. Therefore to minimize memory usage, one should set the smaller of the two files as the B file.

Now, let’s do some sanity checks. Let’s test the -f parameter:

cat a.bed
#chr1    500 1000

cat b.bed
#chr1    750 1000

intersectBed -a a.bed -b b.bed -f 0.50
#chr1    750 1000

#if I increase b.bed, it should still return a result
cat a.bed
#chr1    500 1000

cat b.bed
#chr1    750 2000

intersectBed -a a.bed -b b.bed -f 0.50
#chr1    750 1000

#if I make a.bed longer, the -f 0.50 condition is not fulfilled
cat a.bed
#chr1    499 1000

cat b.bed
#chr1    750 2000

intersectBed -a a.bed -b b.bed -f 0.50
#returns nothing

So if I want to check whether a feature of interest has at least 50% of its length overlapping another feature, the feature of interest should be the -a BED file i.e. the first BED file.

Testing reciprocity (-r):

cat a.bed
#chr1    500 1000

cat b.bed
#chr1    750 1250

intersectBed -a a.bed -b b.bed -f 0.50 -r
#chr1    750 1000

cat a.bed
#chr1    500 1000

cat b.bed
#chr1    750 1251

intersectBed -a a.bed -b b.bed -f 0.50 -r
#returns nothing

intersectBed with BED12

I made this BED12 file:

chr1 1 1000 test 0 + 1 1000 0 2 100,100 0,899

So there are two blocks/exons at positions 1-100 and at 899-1000.

I then made this BAM file:

read1 0 chr1 50 37 21M * 0 0 GTATGCAGTATATACATGACT gggfcfcfcff^ffdgfgg]g XT:A:U NM:i:0 X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:21
read2 0 chr1 500 37 21M * 0 0 GTATGCAGTATATACATGACT gggfcfcfcff^ffdgfgg]g XT:A:U NM:i:0 X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:21

Then I ran:


#Version: 2.15.0
intersectBed -abam haha.bam -b test.bed -bed -wo

and got:

chr1 49 70 read1 37 + chr1 1 1000 test 0 + 1 1000 0 2 100,100 0,899 21
chr1 499 520 read2 37 + chr1 1 1000 test 0 + 1 1000 0 2 100,100 0,899 21

So intersectBed does not take into account the different blocks. This was posted by the developer (Aaron Quinlan) on SEQanswers:

(3) Native support for “blocked” BED features (aka BED12). Note that each block is not considered separately. BEDTools merely allows one to use BED12 files and the last 6 fields are “passed through” the tools.

So in order to intersect with blocks/exons, they should be separate entities in the BED file.

Other tips

#standard in / stream
cat a.bed | intersectBed -a stdin -b b.bed

#this is useful when you want to perform two intersect steps
intersectBed -a first.bed -b second.bed | intersectBed -a stdin -b third.bed

Lastly, below is what Sean Eddy has to say about BEDTools:

Print Friendly, PDF & Email



Creative Commons License
This work is licensed under a Creative Commons
Attribution 4.0 International License
.
11 comments Add yours
  1. Hello,
    I need some kind help please, as i am a beginner in learning BEDTOOLS.
    i have to extract SNPs from my VCF file based on GFF file, i mean there are some sequence IDs (ID=comp1207_c1_seq1.mrna1.cds5;Name=comp1207_c1_seq1;Parent=comp1207_c1_seq1.mrna1) in my GFF file which also contain the word “cds” but the same ID is also in VCF file (comp1207_c0_seq1 432 . A T 62.44 . AC=1;AF=0.50;AN=2;BaseQRankSum=0.135;DP=12;Dels=0.00;FS=2.533;HRun=4;HaplotypeScore=0.0000;MQ=40.7) BUT without this word “CDS”, so by using intersectBed i have to locate and extract all these IDs with CDS from VCF file, i dont know how to do this,i tried these commands but they did not work.
    tahir@aspseq:~$ intersectBed -a 26-Gper.vcf -b codes.gff
    tahir@aspseq:~$ intersectBed -a codes.gff -b 26-Gper.vcf

      1. Thankyou dave,
        I thought about it but its written there that intersectBED also supports BAM, GFF,VCF formats also, so do you still think that have to convert my GFF and VCF files into BED?.

  2. anyway i have converted my VCF and GFF files into BED files but still got nothing through
    intersectBed -a 349.2.bed -b codes.bed> got.bed

    1. It’s a bit hard to troubleshoot over the web like this. Could you make available the files somewhere (e.g. via Dropbox share or a personal server) and email (me @ davetang.org) the link. I’ll take a look.

  3. Thankyou Dave, sure i will send you my files but just wait untill i finish some other urgent tasks. just hold for a while.
    Thanx again.

  4. Hej Dave
    sorry for being late but i have got the solution, infact i forgot to remove
    the all next columns after first three columns from BED files, whcih
    intersectBEd was not able to read and intersect, anyway i have another
    short problem, if you send me your personal email adress, i can send you
    the Qs and file details.
    kind regards.

  5. Hi Dave,

    is there a quicker way if we have single base positions e.g. in case of methylation and a set of intervals and want to calculate the average coverage in that region. I was thinking about using binary search for that but I was wondering if it can be done quickly or easily with bedtools or genomic ranges and how.

    here is the input interval file e.g. :-

    Gm01 12884 12933
    Gm01 89985 90016
    Gm01 116973 117199
    Gm01 126226 126227
    Gm01 131876 131993
    Gm01 200358 200753
    Gm01 236785 237014
    Gm01 341452 341895
    Gm01 347664 347676
    Gm01 357195 357222

    and here is the methylation file :-

    Gm01 42 W82_RH25_R1 + CG 3 3
    Gm01 43 W82_RH25_R1 – CG 0 1
    Gm01 170 W82_RH25_R1 + CG 26 30
    Gm01 171 W82_RH25_R1 – CG 13 18
    Gm01 192 W82_RH25_R1 + CG 27 31
    Gm01 193 W82_RH25_R1 – CG 13 15
    Gm01 199 W82_RH25_R1 + CG 27 30
    Gm01 200 W82_RH25_R1 – CG 10 12
    Gm01 301 W82_RH25_R1 + CG 14 16
    Gm01 302 W82_RH25_R1 – CG 14 15
    Gm01 326 W82_RH25_R1 + CG 8 12
    Gm01 327 W82_RH25_R1 – CG 11 12
    Gm01 340 W82_RH25_R1 + CG 7 9

    last column is coverage and I want to calculate average coverage over a region is there any easier way to do that.

    regards
    Saad

    1. Hi Saad,

      yeah it shouldn’t be hard. First add an extra column in your methylation file; the third column is the second column + 1. So if a.bed is your interval file and b.bed is the methylation file, do something like this (untested):

      intersectBed -a a.bed -b b.bed -wo | awk '{print $1,$2,$3,$11}' | groupBy -g 1,2,3 -c 4 -o mean

      Cheers,

      Dave

  6. Why the following lines don’t give the same result?

    intersectBed -a first.bed -b second.bed | intersectBed -a stdin -b third.bed | wc -l
    intersectBed -a third.bed -b second.bed | intersectBed -a stdin -b first.bed | wc -l
    intersectBed -a second.bed -b third.bed | intersectBed -a stdin -b first.bed | wc -l

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.