Calculating intergenic regions

Intergenic regions are simply loci in the genome demarked by where one gene ends and another starts.

To calculate intergenic regions:

  1. First create a BED file containing the coordinates of all genes
  2. Sort this BED file by chromosome and then by the starting position
  3. Merge this BED file using mergeBed
  4. Run the script below (works only with hg19), using as input the merged BED file
#!/usr/bin/perl

use strict;
use warnings;

my $usage = "Usage: $0 <infile.bed>\n";
my $infile = shift or die $usage;

#DATA used for calculating final intergenic region
#change the DATA coordinates to your genome of interest 
my %chr_end = ();
while(<DATA>){
   chomp;
   my ($chr,$end) = split(/\s+/);
   $chr_end{$chr} = $end;
}
close(DATA);

my $data = ();

open(IN,'<',$infile) || die "Could not open $infile: $!\n";
while(<IN>){
   chomp;
   my($chr,$start,$end) = split(/\t/);
   next unless exists $chr_end{$chr};
   if (!exists $data->{$chr}){
      $data->{$chr}->[0]->{'START'} = $start;
      $data->{$chr}->[0]->{'END'} = $end;
   } else {
      my $index = scalar(@{$data->{$chr}});
      $data->{$chr}->[$index]->{'START'} = $start;
      $data->{$chr}->[$index]->{'END'} = $end;
   }
}
close(IN);

foreach my $chr (keys %{$data}){
   for(my $i=0; $i<scalar(@{$data->{$chr}}); ++$i){
      my $start = '';
      my $end = '';
      my $last_start = '';
      my $last_end = '';
      #print "$chr\t$data->{$chr}->[$i]->{START}\t$data->{$chr}->[$i]->{END}\n";
      if ($i == 0){
         $start = '1';
         $end = $data->{$chr}->[$i]->{'START'} - 1;
      } elsif ($i == scalar(@{$data->{$chr}})-1){
         $last_start = $data->{$chr}->[$i]->{'END'} + 1;
         $last_end = $chr_end{$chr};
         $start = $data->{$chr}->[$i-1]->{'END'} + 1;
         $end = $data->{$chr}->[$i]->{'START'} - 1;
      } else {
         $start = $data->{$chr}->[$i-1]->{'END'} + 1;
         $end = $data->{$chr}->[$i]->{'START'} - 1;
      }
      next if $start > $end;
      print join("\t",$chr,$start,$end),"\n";
      if ($i == scalar(@{$data->{$chr}})-1){
         print join("\t",$chr,$last_start,$last_end),"\n";
      }
   }
}

exit(0);

#hg19 chromosome ends used for calculating last intergenic region
__DATA__
chr1    249250621
chr2    243199373
chr3    198022430
chr4    191154276
chr5    180915260
chr6    171115067
chr7    159138663
chrX    155270560
chr8    146364022
chr9    141213431
chr10   135534747
chr11   135006516
chr12   133851895
chr13   115169878
chr14   107349540
chr15   102531392
chr16   90354753
chr17   81195210
chr18   78077248
chr20   63025520
chrY    59373566
chr19   59128983
chr22   51304566
chr21   48129895

Once you've saved the output from the script, one way to check the results is by using intersectBed.

intersectBed -a merged.bed -b intergenic.bed

And nothing should be returned, which was the case when I performed this.

Genome mapability

I know of the genome mapability and uniqueness tracks provided by the UCSC Genome Browser but I was just interested in doing this myself for the hg19 genome.

As a test, I investigated chr22, where the base composition is broken down as:

Length of chr22 = 51,304,566
A: 9,094,775
C: 8,375,984
G: 8,369,235
T: 9,054,551
Other: 16,410,021

I wrote a script to generate 35mer reads by sliding 1 bp along the chromosome:


#!/usr/bin/perl

use strict;
use warnings;

my $usage = "Usage: $0 <infile.fa>\n";
my $infile = shift or die $usage;

my $read_length = '35';
my $current_accession = '';
my %seq = ();

open(IN,'<',$infile) || die "Could not open $infile: $!\n";
while(<IN>){
   chomp;
   next if /^$/;
   if (/^>(.*)$/){
      $current_accession = $1;
      warn "Storing $current_accession\n";
   } else {
      $seq{$current_accession} .= $_;
   }
}
close(IN);

warn "Finished storing genome\n";

foreach my $chr (sort {$a cmp $b} keys %seq){
   my $length = length($seq{$chr});
   my $end = $length - $read_length;
   for (my $i = 0; $i <= $end; ++$i){
      my $read = substr($seq{$chr},$i,$read_length);
      $read = uc($read);
      next if $read =~ /N/;
      die "$read\n" unless $read =~ /[ATGC]+/;
      my $chr_start = $i + 1;
      my $chr_end = $i + $read_length;
      my $id = $chr . '_' . $chr_start . '_' . $chr_end;
      print ">$id\n$read\n";
   }
}

exit(0);

Any read that contained a "N" was removed. I then mapped these 35mers back to the hg19 genome using BWA, and here are the statistics:

Mapped: 34,894,095
Unmapped: 0
Edit distances
0 34,894,095
Multimap profile
Mapped uniquely 28,615,801
Mapped to 2 locations 1,350,653
Mapped to 3 locations 851,814
Mapped to 4 locations 388,360
Mapped to 5 locations 239,843
Mapped to 6 locations 194,604
Mapped to 7 locations 184,291
Mapped to 8 locations 149,467
Mapped to 9 locations 92,002
Mapped to 10 locations 66,698
Mapped to more than 10 location 2,760,562

28,615,801 / 34,894,095 (~82%) could be mapped uniquely.

Using the GENCODE exon list and intersectBed I characterised the list of uniquely and multi mapping reads.

350,930 / 6,278,294 (5.59%) of the multi mapping reads intersected GENCODE exons.

2,441,765 / 28,615,801 (8.53%) of the uniquely mapped reads intersected GENCODE exons; only a slighter higher percentage than multimappers.

It seems that most of chr22 contains unique stretches of nucleotides. These unique regions are not particularly enriched in exonic sequences in contrast to multimappers. Of the multi mapping reads that intersected exons, it would be interesting to see how many of these map to genes and pseudogenes, especially when we have 1,350,653 reads that mapped to two places in the genome.

And lastly by using genomeCoverageBed and by separating out the uniquely mapping reads, we can create a track that shows the unique parts of the genome.

genomeCoverageBed -ibam chr22_read_sorted_unique.bam -g ~/davetang/chrom_size/hg19_chrom_info.txt -d > result

cat result | awk '$3 != 0 {print}' | head
#chr22   16050002        1
#chr22   16050003        2
#chr22   16050004        3
#chr22   16050005        4
#chr22   16050006        5
#chr22   16050007        6
#chr22   16050008        7
#chr22   16050009        8
#chr22   16050010        9
#chr22   16050011        10

The third column, which shows the number of reads covering this position, is meaningless due to the way I slid across the chromosome. One can just change the values this into 1s and 0s, where 1 is unique and 0 is not unique.

cat result | perl -nle '@a=split; if ($a[2] != 0){ print "$a[0]\t$a[1]\t1" } else {print}'

Lastly a note about the Duke uniqueness tracks provided by the UCSC Genome Browser, which you can download here, since it's almost the same as what I showed above. For the 35mer track, here's part of the track:

chr1 68849 68859 0.333333
chr1 68859 68861 0.5
chr1 68861 68896 1
chr1 68896 69099 0.333333

I checked all the scores (fourth column) and they are either 0, 0.25, 0.333333, 0.5 and 1.

0 = mapped to more than 4 places
0.25 = mapped to 4 places
0.33 = mapped to 3 places
0.5 = mapped to 2 places
1 = mapped to 1 place

The scores are assigned to the first base of the sequence. Using the example above:

chr1 68859 68861 0.5

That should mean that chr1:68859-68894 and chr1:68860-68895 and chr1:68861-68896 all map to two places in the genome. To test this I extracted the sequence at chr1:68859-68894, which is

GAATTATAAGGCATAGCCAGGGAAGTAGTGCGAGAT

And blat returns three hits, with two best hits:

   ACTIONS      QUERY           SCORE START  END QSIZE IDENTITY CHRO STRAND  START    END      SPAN
---------------------------------------------------------------------------------------------------
browser details YourSeq           36     1    36    36 100.0%    19   +     110447    110482     36
browser details YourSeq           36     1    36    36 100.0%     1   +      68859     68894     36
browser details YourSeq           35     1    35    36 100.0%    15   -  102463460 102463494     35

And another test for chr1:68860-68895, which is this sequence AATTATAAGGCATAGCCAGGGAAGTAGTGCGAGATA and the blat result

   ACTIONS      QUERY           SCORE START  END QSIZE IDENTITY CHRO STRAND  START    END      SPAN
---------------------------------------------------------------------------------------------------
browser details YourSeq           36     1    36    36 100.0%    19   +     110448    110483     36
browser details YourSeq           36     1    36    36 100.0%     1   +      68860     68895     36
browser details YourSeq           34     1    36    36  97.3%    15   -  102463458 102463493     36

Here's a slide I made for the Duke uniqueness track using 20 bp windows:

GENCODE

By now you should have heard about the ENCODE project. GENCODE, summarised as Encyclopædia of genes and gene variants, is a sub-project of ENCODE where the aim is to annotate all evidence-based gene features in the entire human genome with high accuracy. This includes protein coding genes and their isoforms, non coding RNAs and pseudogenes. The process to create this annotation involves (1) manual curation, (2) different computational analysis and (3) targeted experimental approaches. Putative loci can be verified by wet-lab experiments and computational predictions will be analysed manually.

The GENCODE gene sets have been used by the ENCODE consortium as well as the 1000 Genomes Project as reference gene sets.

For those wishing to download the datasets directly, visit the ftp site.

For each feature there are three different confidence levels and in verbatim from the README:

Level 1: validated

Pseudogene loci, that were predicted by the analysis-pipelines from YALE, UCSC as well as by HAVANA manual annotation from WTSI. Other transcripts, that were verified experimentally by RT-PCR and sequencing.

Level 2: manual annotation

HAVANA manual annotation from WTSI (and Ensembl annotation where it is identical to Havana). The following regions are considered "fully annotated" although they will still be updated: chromosomes 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, X, Y, ENCODE pilot regions.

Level 3: automated annotation

ENSEMBL loci where they are different from the HAVANA annotation or where no annotation can be found.

In the latest release (13), there are these files (descriptions taken from the header of each respective file):

gencode.v13.2wayconspseudos.gtf.gz - evidence-based annotation of the human genome (GRCh37), version 13 (Ensembl 68) - Yale-UCSC 2-way consensus pseudogenes

gencode.v13.annotation.gtf.gz gz Archive - evidence-based annotation of the human genome (GRCh37), version 13 (Ensembl 68)

gencode.v13.annotation_corrected_statuses.gtf.gz - evidence-based annotation of the human genome (GRCh37), version 13 (Ensembl 68)

gencode.v13.long_noncoding_RNAs.gtf.gz - evidence-based annotation of the human genome (GRCh37), version 13 (Ensembl 68) - long non-coding RNAs

gencode.v13.pc_transcripts.fa.gz - nucleotide fasta file of the transcripts (89,828 fasta header lines)

gencode.v13.pc_translations.fa.gz - amino acid fasta file of the transcripts (89,828 fasta header lines)

gencode.v13.polyAs.gtf.gz - evidence-based annotation of the human genome (GRCh37) - polyA features

gencode.v13.tRNAs.gtf.gz - evidence-based annotation of the human genome (GRCh37) - tRNAs (625 tRNAs)

gencode13_GRCh37.tgz - this tarball contains another layer of files:

gencode.v13.2wayconspseudos.classes
gencode.v13.2wayconspseudos.gtf
gencode.v13.annotation.level_1_2.classes
gencode.v13.annotation.level_1_2.gtf
gencode.v13.annotation.level_3.classes
gencode.v13.annotation.level_3.gtf
gencode.v13.polyAs.classes
gencode.v13.polyAs.gtf

Understanding the gencode.v13.annotation_corrected_statuses.gtf.gz file

For information on GTF files: http://www.gencodegenes.org/gencodeformat.html

There are 2,504,987 lines in this file. The breakdown of the feature types (column 3) in this file are (where the first column shows the tallied number):

 702193 CDS
    103 Selenocysteine
 268425 UTR
1143113 exon
  55123 gene
  80249 start_codon
  72814 stop_codon
 182967 transcript

As you can see above, there are more transcripts than genes; more statistics on the data release can be found here. For example the gene DDX11L1 has one gene entry in the GTF file and several transcripts:

chr1 HAVANA gene 11869 14412 . + . gene_id "ENSG00000223972.4"; transcript_id "ENSG00000223972.4"; gene_type "pseudogene"; gene_status "NULL"; gene_name "DDX11L1"; transcript_type "pseudogene"; transcript_status "NULL"; transcript_name "DDX11L1"; level 2; havana_gene "OTTHUMG00000000961.2";

chr1 HAVANA transcript 11869 14409 . + . gene_id "ENSG00000223972.4"; transcript_id "ENST00000456328.2"; gene_type "pseudogene"; gene_status "NULL"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_status "NULL"; transcript_name "DDX11L1-002"; level 2; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";
chr1 ENSEMBL transcript 11872 14412 . + . gene_id "ENSG00000223972.4"; transcript_id "ENST00000515242.2"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type "unprocessed_pseudogene"; transcript_status "KNOWN"; transcript_name "DDX11L1-201"; level 3; havana_gene "OTTHUMG00000000961.2";
chr1 ENSEMBL transcript 11874 14409 . + . gene_id "ENSG00000223972.4"; transcript_id "ENST00000518655.2"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type "unprocessed_pseudogene"; transcript_status "KNOWN"; transcript_name "DDX11L1-202"; level 3; havana_gene "OTTHUMG00000000961.2";
chr1 HAVANA transcript 12010 13670 . + . gene_id "ENSG00000223972.4"; transcript_id "ENST00000450305.2"; gene_type "pseudogene"; gene_status "NULL"; gene_name "DDX11L1"; transcript_type "transcribed_unprocessed_pseudogene"; transcript_status "NULL"; transcript_name "DDX11L1-001"; level 2; ont "PGO:0000005"; ont "PGO:0000019"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000002844.2";

Full descriptions of different biotypes are available at http://www.gencodegenes.org/gencode_biotypes.html and can be found in the additional information column (column 9). For example a "processed_transcript" doesn't contain an ORF and a "pseudogene" has homology to proteins but generally suffer from a disrupted coding sequence and an active homologous gene can be found at another locus.

The latest hg19 RefSeq release contains 40,346 entries and the latest UCSC genes release contains 80,922 entries (both downloaded from the UCSC Table Browser). I have been using the RefSeq for many years but it is time to move onto GENCODE.

And lastly here's an anything but elegant script that converts the GTP file into a BED12 file for all transcripts:


#!/usr/bin/perl

use strict;
use warnings;

my $usage = "Usage: $0 <infile.annotation.gtf>\n";
my $infile = shift or die $usage;

if ($infile =~ /\.gz/){
   open(IN,'-|',"gunzip -c $infile") || die "Could not open $infile: $!\n";
} else {
   open(IN,'<',$infile) || die "Could not open $infile: $!\n";
}

my %transcript = ();
my $block_count = '0';
my $block_size = '';
my $block_start = '';
while(<IN>){
   chomp;
   next if (/^#/);
   my ($chr,$source,$type,$start,$end,$score,$strand,$phase,$annotation) = split(/\t/);
   next unless $type eq 'transcript' || $type eq 'exon';
   my @annotation = split(/;\s/,$annotation);
   my $transcript_id = 'none';
   foreach my $blah (@annotation){
      my ($type,$name) = split(/\s+/,$blah);
      #if ($type eq 'transcript_name'){
      if ($type eq 'transcript_id'){
         $transcript_id = $name;
         $transcript_id =~ s/"//g;
      }
   }
   if ($transcript_id eq 'none'){
      die "No name for entry $.\n";
   }
   if (exists $transcript{$transcript_id}){
      ++$block_count;
      my $this_block_size = $end - $start + 1;
      my $this_block_start = $start - $transcript{$transcript_id}->{'START'} - 1;
      $block_size .= "$this_block_size,";
      $block_start .= "$this_block_start,";
      if ($strand eq '+'){
         if ($end == $transcript{$transcript_id}->{'END'}){
            $block_count =~ s/,$//;
            $block_size =~ s/,$//;
            $block_start =~ s/,$//;
            print join("\t",$chr,$transcript{$transcript_id}->{'START'},$transcript{$transcript_id}->{'END'},$transcript_id,0,$strand,$transcript{$transcript_id}->{'START'},$transcript{$transcript_id}->{'END'},'255,0,0',$block_count,$block_size,$block_start),"\n";
            $block_count = '0';
            $block_size = '';
            $block_start = '';
            next;
         }
      } else {
         if ($start - 1 == $transcript{$transcript_id}->{'START'}){
            $block_count =~ s/,$//;
            $block_size =~ s/,$//;
            $block_start =~ s/,$//;
            my @block_start = split(/,/,$block_start);
            my @block_size = split(/,/,$block_size);
            @block_start = reverse(@block_start);
            @block_size = reverse(@block_size);
            $block_start = join(",",@block_start);
            $block_size = join(",",@block_size);
            print join("\t",$chr,$transcript{$transcript_id}->{'START'},$transcript{$transcript_id}->{'END'},$transcript_id,0,$strand,$transcript{$transcript_id}->{'START'},$transcript{$transcript_id}->{'END'},'255,0,0',$block_count,$block_size,$block_start),"\n";
            $block_count = '0';
            $block_size = '';
            $block_start = '';
            next;
         }
      }
   } else {
      $transcript{$transcript_id}->{'START'} = $start - 1;
      $transcript{$transcript_id}->{'END'} = $end;
   }
}
close(IN);

exit(0);

I could successfully upload the BED file to the UCSC Genome Browser and visually double checked a handful of transcripts and they looked exactly the same as the GENCODE v12 transcript models available on the UCSC Genome Browser.