Find regions of the mouse genome devoid of any annotation (ESTs, mRNA, repeats, RefSeq and UCSC genes).
Annotation tracks downloaded using the table browser feature of the UCSC Genome Browser. Chromosome sizes of mm9 downloaded from here.
Code for finding regions of 10kb devoid of any annotation.
#!/usr/bin/perl
use strict;
use warnings;
my $bin = 10_000;
my $chrom_file = '/home/davetang/davetang/chrom_size/mm9_chrom_info.txt';
my %chr = ();
open(IN,'<',$chrom_file) || die "Could not open $chrom_file: $!\n";
while(<IN>){
chomp;
#chr1 197195432 /gbdb/mm9/mm9.2bit
my ($chr,$size,$junk) = split(/\s+/);
$chr{$chr} = $size;
}
close(IN);
my %feature = ();
my @file = qw/mm9_est_gene.bed.gz
mm9_mrna_gene.bed.gz
mm9_refseq_gene.bed.gz
mm9_repeatmasker_gene.bed.gz
mm9_ucsc_gene.bed.gz/;
foreach my $infile (@file){
open(IN,'-|',"zcat $infile") || die "Could not open $infile: $!\n";
while(<IN>){
chomp;
#chr1 134212712 134221648 BB62599
my $which_bin = '';
my ($chr,$start,$end,@junk) = split(/\t/);
for (my $i = $start; $i <= $end; $i+=$bin){
$which_bin = int($i / $bin);
$feature{$chr}{$which_bin} = '1';
}
$which_bin = int($end / $bin);
$feature{$chr}{$which_bin} = '1';
}
close(IN);
}
foreach my $chr (keys %chr){
my $remainder = int($chr{$chr} / $bin);
foreach (0 .. $remainder){
next if exists($feature{$chr}{$_});
my $start = $_ * $bin;
my $end = $start + 9999;
my $id = $chr . '_' . $start . '_' . $end;
print join("\t",$chr,$start,$end,$id),"\n";
}
}
exit(0);
__END__
In the mm9 genome I found 9,634 10kb regions with no annotation (corresponding to 96,340,000 bp of the genome with no annotation). A few lines of the output (bed format):
chr1 87310000 87319999 chr1_87310000_87319999
chr1 87320000 87329999 chr1_87320000_87329999
chr1 87330000 87339999 chr1_87330000_87339999
chr1 184930000 184939999 chr1_184930000_184939999
chr1 184940000 184949999 chr1_184940000_184949999

This work is licensed under a Creative Commons
Attribution 4.0 International License.
Interesting, I just used this.
But then I had a thought. I feel like I could do something similar using bedtools. Create a bed file with each chromosome as a bed file, and then intersect with various gene lists to only keep non-coding regions, and then filter the list by region size.
Regardless, thanks for sharing.
Hi Sam,
yes, please use BEDTools instead! For example: https://github.com/davetang/defining_genomic_regions.
By the way, how did you find this post? I’m curious as you are the first comment in just over 10 years!
Cheers,
Dave
Hi Dave,
Haha, I was on the hunt for large gene deserts and google brought me to it. Kudos to you as well for continuing on with the blog. And hey, some code from your PhD still used/inspiring ideas!
Best,
Sam