Using bins when comparing genomic features

Comparing two files containing genomic features is a common task e.g. finding out whether the coordinates of your tags intersect with genes. Of course you could use intersectBed (as part of the BEDTools suite) for this purpose but here's how to do it anyway using Perl. NOTE: I hard code the length of my tags in this script as 27 bp.

The key to binning is this line of code:

my $which_bin = int($start / $bin); #where $bin = 100,000

Any genomic feature in the bed file that is less than 100,000 will be in bin 0. Anything in 100,000 to 199,999 is in bin 1 and so on. These bins are then used as hash keys.

This assumes a non-redundant bed file but can be changed by removing the last CHECK line of code.


#!/usr/bin/perl

use strict;
use warnings;

my $usage = "Usage: $0 <infile.bam> <infile.bed>\n";
my $infile = shift or die $usage;
my $bed = shift or die $usage;

my %bed = ();
my $counter = '0';
my $bin = 100_000;

warn "Processing $bed\n";

open(IN,'<',$bed) || die "Could not open $bed: $!\n";
while(<IN>){
   chomp;
   ++$counter;
   #chr1    134938   134939   name  0       -
   my ($chr,$start,$end,$name,$score,$strand) = split(/\t/);
   my $which_bin = int($start / $bin);
   $bed{$chr}{$strand}{$which_bin}{$counter}->{'START'} = $start;
   $bed{$chr}{$strand}{$which_bin}{$counter}->{'END'} = $end;
}
close(IN);

warn "Finished storing $bed\n";

my $bam_entry = '0';
open(IN,'-|',"samtools view $infile") || die "Could not open $infile: $!\n";
while(<IN>){
   chomp;
   my ($qname,$flag,$rname,$pos,$mapq,$cigar,$mrnm,$mpos,$isize,$seq,$qual,$tag,$vtype,$value) = split;
   next if $flag == '4';
   my $strand = '+';
   my $start = $pos;
   if ($flag == '16'){
      $strand = '-';
      $start = $pos + '27';
   }
   ++$bam_entry;
   my $which_bin = int($start/$bin);
   CHECK: foreach my $entry (sort {$a <=> $b} keys %{$bed{$rname}{$strand}{$which_bin}}){
      my $s_start = $bed{$rname}{$strand}{$which_bin}{$entry}->{'START'};
      my $s_end = $bed{$rname}{$strand}{$which_bin}{$entry}->{'END'};
      #print "$entry\t$rname\t$s_start\t$s_end\n";
      if ($start >= $s_start && $start <= $s_end){
         print "$_\n";
         last CHECK;
      }
   }
   warn "Processed $bam_entry bam entries\n";
}
close(IN);

exit(0);

Print Friendly, PDF & Email



Creative Commons License
This work is licensed under a Creative Commons
Attribution 4.0 International License
.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.