Calculating intergenic regions

Intergenic regions are simply loci in the genome demarked by where one gene ends and another starts.

To calculate intergenic regions:

  1. First create a BED file containing the coordinates of all genes
  2. Sort this BED file by chromosome and then by the starting position
  3. Merge this BED file using mergeBed
  4. Run the script below (works only with hg19), using as input the merged BED file
#!/usr/bin/perl

use strict;
use warnings;

my $usage = "Usage: $0 <infile.bed>\n";
my $infile = shift or die $usage;

#DATA used for calculating final intergenic region
#change the DATA coordinates to your genome of interest 
my %chr_end = ();
while(<DATA>){
   chomp;
   my ($chr,$end) = split(/\s+/);
   $chr_end{$chr} = $end;
}
close(DATA);

my $data = ();

open(IN,'<',$infile) || die "Could not open $infile: $!\n";
while(<IN>){
   chomp;
   my($chr,$start,$end) = split(/\t/);
   next unless exists $chr_end{$chr};
   if (!exists $data->{$chr}){
      $data->{$chr}->[0]->{'START'} = $start;
      $data->{$chr}->[0]->{'END'} = $end;
   } else {
      my $index = scalar(@{$data->{$chr}});
      $data->{$chr}->[$index]->{'START'} = $start;
      $data->{$chr}->[$index]->{'END'} = $end;
   }
}
close(IN);

foreach my $chr (keys %{$data}){
   for(my $i=0; $i<scalar(@{$data->{$chr}}); ++$i){
      my $start = '';
      my $end = '';
      my $last_start = '';
      my $last_end = '';
      #print "$chr\t$data->{$chr}->[$i]->{START}\t$data->{$chr}->[$i]->{END}\n";
      if ($i == 0){
         $start = '1';
         $end = $data->{$chr}->[$i]->{'START'} - 1;
      } elsif ($i == scalar(@{$data->{$chr}})-1){
         $last_start = $data->{$chr}->[$i]->{'END'} + 1;
         $last_end = $chr_end{$chr};
         $start = $data->{$chr}->[$i-1]->{'END'} + 1;
         $end = $data->{$chr}->[$i]->{'START'} - 1;
      } else {
         $start = $data->{$chr}->[$i-1]->{'END'} + 1;
         $end = $data->{$chr}->[$i]->{'START'} - 1;
      }
      next if $start > $end;
      print join("\t",$chr,$start,$end),"\n";
      if ($i == scalar(@{$data->{$chr}})-1){
         print join("\t",$chr,$last_start,$last_end),"\n";
      }
   }
}

exit(0);

#hg19 chromosome ends used for calculating last intergenic region
__DATA__
chr1    249250621
chr2    243199373
chr3    198022430
chr4    191154276
chr5    180915260
chr6    171115067
chr7    159138663
chrX    155270560
chr8    146364022
chr9    141213431
chr10   135534747
chr11   135006516
chr12   133851895
chr13   115169878
chr14   107349540
chr15   102531392
chr16   90354753
chr17   81195210
chr18   78077248
chr20   63025520
chrY    59373566
chr19   59128983
chr22   51304566
chr21   48129895

Once you've saved the output from the script, one way to check the results is by using intersectBed.

intersectBed -a merged.bed -b intergenic.bed

And nothing should be returned, which was the case when I performed this.




Creative Commons License
This work is licensed under a Creative Commons
Attribution 4.0 International License
.
One comment Add yours
  1. Hi,
    Your script gives an error:
    Name “main::DATA” used only once: possible typo at ./script.pl line 12.
    readline() on unopened filehandle DATA at ./script.pl line 12.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.