GENCODE

By now you should have heard about the ENCODE project. GENCODE, summarised as Encyclopædia of genes and gene variants, is a sub-project of ENCODE where the aim is to annotate all evidence-based gene features in the entire human genome with high accuracy. This includes protein coding genes and their isoforms, non coding RNAs and pseudogenes. The process to create this annotation involves (1) manual curation, (2) different computational analysis and (3) targeted experimental approaches. Putative loci can be verified by wet-lab experiments and computational predictions will be analysed manually.

The GENCODE gene sets have been used by the ENCODE consortium as well as the 1000 Genomes Project as reference gene sets.

For those wishing to download the datasets directly, visit the ftp site.

For each feature there are three different confidence levels and in verbatim from the README:

Level 1: validated

Pseudogene loci, that were predicted by the analysis-pipelines from YALE, UCSC as well as by HAVANA manual annotation from WTSI. Other transcripts, that were verified experimentally by RT-PCR and sequencing.

Level 2: manual annotation

HAVANA manual annotation from WTSI (and Ensembl annotation where it is identical to Havana). The following regions are considered "fully annotated" although they will still be updated: chromosomes 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, X, Y, ENCODE pilot regions.

Level 3: automated annotation

ENSEMBL loci where they are different from the HAVANA annotation or where no annotation can be found.

In the latest release (13), there are these files (descriptions taken from the header of each respective file):

gencode.v13.2wayconspseudos.gtf.gz - evidence-based annotation of the human genome (GRCh37), version 13 (Ensembl 68) - Yale-UCSC 2-way consensus pseudogenes

gencode.v13.annotation.gtf.gz gz Archive - evidence-based annotation of the human genome (GRCh37), version 13 (Ensembl 68)

gencode.v13.annotation_corrected_statuses.gtf.gz - evidence-based annotation of the human genome (GRCh37), version 13 (Ensembl 68)

gencode.v13.long_noncoding_RNAs.gtf.gz - evidence-based annotation of the human genome (GRCh37), version 13 (Ensembl 68) - long non-coding RNAs

gencode.v13.pc_transcripts.fa.gz - nucleotide fasta file of the transcripts (89,828 fasta header lines)

gencode.v13.pc_translations.fa.gz - amino acid fasta file of the transcripts (89,828 fasta header lines)

gencode.v13.polyAs.gtf.gz - evidence-based annotation of the human genome (GRCh37) - polyA features

gencode.v13.tRNAs.gtf.gz - evidence-based annotation of the human genome (GRCh37) - tRNAs (625 tRNAs)

gencode13_GRCh37.tgz - this tarball contains another layer of files:

gencode.v13.2wayconspseudos.classes
gencode.v13.2wayconspseudos.gtf
gencode.v13.annotation.level_1_2.classes
gencode.v13.annotation.level_1_2.gtf
gencode.v13.annotation.level_3.classes
gencode.v13.annotation.level_3.gtf
gencode.v13.polyAs.classes
gencode.v13.polyAs.gtf

Understanding the gencode.v13.annotation_corrected_statuses.gtf.gz file

For information on GTF files: http://www.gencodegenes.org/gencodeformat.html

There are 2,504,987 lines in this file. The breakdown of the feature types (column 3) in this file are (where the first column shows the tallied number):

 702193 CDS
    103 Selenocysteine
 268425 UTR
1143113 exon
  55123 gene
  80249 start_codon
  72814 stop_codon
 182967 transcript

As you can see above, there are more transcripts than genes; more statistics on the data release can be found here. For example the gene DDX11L1 has one gene entry in the GTF file and several transcripts:

chr1 HAVANA gene 11869 14412 . + . gene_id "ENSG00000223972.4"; transcript_id "ENSG00000223972.4"; gene_type "pseudogene"; gene_status "NULL"; gene_name "DDX11L1"; transcript_type "pseudogene"; transcript_status "NULL"; transcript_name "DDX11L1"; level 2; havana_gene "OTTHUMG00000000961.2";

chr1 HAVANA transcript 11869 14409 . + . gene_id "ENSG00000223972.4"; transcript_id "ENST00000456328.2"; gene_type "pseudogene"; gene_status "NULL"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_status "NULL"; transcript_name "DDX11L1-002"; level 2; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";
chr1 ENSEMBL transcript 11872 14412 . + . gene_id "ENSG00000223972.4"; transcript_id "ENST00000515242.2"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type "unprocessed_pseudogene"; transcript_status "KNOWN"; transcript_name "DDX11L1-201"; level 3; havana_gene "OTTHUMG00000000961.2";
chr1 ENSEMBL transcript 11874 14409 . + . gene_id "ENSG00000223972.4"; transcript_id "ENST00000518655.2"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type "unprocessed_pseudogene"; transcript_status "KNOWN"; transcript_name "DDX11L1-202"; level 3; havana_gene "OTTHUMG00000000961.2";
chr1 HAVANA transcript 12010 13670 . + . gene_id "ENSG00000223972.4"; transcript_id "ENST00000450305.2"; gene_type "pseudogene"; gene_status "NULL"; gene_name "DDX11L1"; transcript_type "transcribed_unprocessed_pseudogene"; transcript_status "NULL"; transcript_name "DDX11L1-001"; level 2; ont "PGO:0000005"; ont "PGO:0000019"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000002844.2";

Full descriptions of different biotypes are available at http://www.gencodegenes.org/gencode_biotypes.html and can be found in the additional information column (column 9). For example a "processed_transcript" doesn't contain an ORF and a "pseudogene" has homology to proteins but generally suffer from a disrupted coding sequence and an active homologous gene can be found at another locus.

The latest hg19 RefSeq release contains 40,346 entries and the latest UCSC genes release contains 80,922 entries (both downloaded from the UCSC Table Browser). I have been using the RefSeq for many years but it is time to move onto GENCODE.

And lastly here's an anything but elegant script that converts the GTP file into a BED12 file for all transcripts:


#!/usr/bin/perl

use strict;
use warnings;

my $usage = "Usage: $0 <infile.annotation.gtf>\n";
my $infile = shift or die $usage;

if ($infile =~ /\.gz/){
   open(IN,'-|',"gunzip -c $infile") || die "Could not open $infile: $!\n";
} else {
   open(IN,'<',$infile) || die "Could not open $infile: $!\n";
}

my %transcript = ();
my $block_count = '0';
my $block_size = '';
my $block_start = '';
while(<IN>){
   chomp;
   next if (/^#/);
   my ($chr,$source,$type,$start,$end,$score,$strand,$phase,$annotation) = split(/\t/);
   next unless $type eq 'transcript' || $type eq 'exon';
   my @annotation = split(/;\s/,$annotation);
   my $transcript_id = 'none';
   foreach my $blah (@annotation){
      my ($type,$name) = split(/\s+/,$blah);
      #if ($type eq 'transcript_name'){
      if ($type eq 'transcript_id'){
         $transcript_id = $name;
         $transcript_id =~ s/"//g;
      }
   }
   if ($transcript_id eq 'none'){
      die "No name for entry $.\n";
   }
   if (exists $transcript{$transcript_id}){
      ++$block_count;
      my $this_block_size = $end - $start + 1;
      my $this_block_start = $start - $transcript{$transcript_id}->{'START'} - 1;
      $block_size .= "$this_block_size,";
      $block_start .= "$this_block_start,";
      if ($strand eq '+'){
         if ($end == $transcript{$transcript_id}->{'END'}){
            $block_count =~ s/,$//;
            $block_size =~ s/,$//;
            $block_start =~ s/,$//;
            print join("\t",$chr,$transcript{$transcript_id}->{'START'},$transcript{$transcript_id}->{'END'},$transcript_id,0,$strand,$transcript{$transcript_id}->{'START'},$transcript{$transcript_id}->{'END'},'255,0,0',$block_count,$block_size,$block_start),"\n";
            $block_count = '0';
            $block_size = '';
            $block_start = '';
            next;
         }
      } else {
         if ($start - 1 == $transcript{$transcript_id}->{'START'}){
            $block_count =~ s/,$//;
            $block_size =~ s/,$//;
            $block_start =~ s/,$//;
            my @block_start = split(/,/,$block_start);
            my @block_size = split(/,/,$block_size);
            @block_start = reverse(@block_start);
            @block_size = reverse(@block_size);
            $block_start = join(",",@block_start);
            $block_size = join(",",@block_size);
            print join("\t",$chr,$transcript{$transcript_id}->{'START'},$transcript{$transcript_id}->{'END'},$transcript_id,0,$strand,$transcript{$transcript_id}->{'START'},$transcript{$transcript_id}->{'END'},'255,0,0',$block_count,$block_size,$block_start),"\n";
            $block_count = '0';
            $block_size = '';
            $block_start = '';
            next;
         }
      }
   } else {
      $transcript{$transcript_id}->{'START'} = $start - 1;
      $transcript{$transcript_id}->{'END'} = $end;
   }
}
close(IN);

exit(0);

I could successfully upload the BED file to the UCSC Genome Browser and visually double checked a handful of transcripts and they looked exactly the same as the GENCODE v12 transcript models available on the UCSC Genome Browser.




Creative Commons License
This work is licensed under a Creative Commons
Attribution 4.0 International License
.
3 comments Add yours

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.