GENCODE

From Dave's wiki
Jump to navigation Jump to search

See a summary on my blog: http://davetang.org/muse/2012/09/12/gencode/

Official website: http://www.gencodegenes.org/

Papers

The GENCODE v7 catalog of human long noncoding RNAs: Analysis of their gene structure, evolution, and expression - http://genome.cshlp.org/content/22/9/1775.full

Files

Download version 19 of the GENCODE gtf file (ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_19/):

wget ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_19/gencode.v19.annotation.gtf.gz

How many lines in GTF (General Feature Format) file with out the comments:

zcat gencode.v19.annotation.gtf.gz | grep -v "^#" | wc -l
2619444

How many annotation sources?

zcat gencode.v19.annotation.gtf.gz | grep -v "^#" | cut -f2 | sort | uniq -c
 354499 ENSEMBL
2264945 HAVANA

How many feature types?

zcat gencode.v19.annotation.gtf.gz | grep -v "^#" | cut -f3 | sort | uniq -c
 723784 CDS
1196293 exon
  57820 gene
    114 Selenocysteine
  84144 start_codon
  76196 stop_codon
 196520 transcript
 284573 UTR

How many annotation types?

zcat gencode.v19.annotation.gtf.gz | grep -v "^#" | cut -f9 | sed 's/;\s*/\n/g' | cut -f1 -d' ' | sort | uniq -c
2619444 
 963475 ccdsid
2080417 exon_id
2080417 exon_number
2619444 gene_id
2619444 gene_name
2619444 gene_status
2619444 gene_type
2585111 havana_gene
2216975 havana_transcript
2619444 level
  46695 ont
4843866 tag
2619444 transcript_id
2619444 transcript_name
2619444 transcript_status
2619444 transcript_type

How many gene_type?

zcat gencode.v19.annotation.gtf.gz | grep -v "^#" | perl -nle 'if (/gene_type\s"(.*?)"/){ print "$1"}' | sort | uniq -c | sort -k1rn
2402832 protein_coding
  70989 pseudogene
  51893 lincRNA
  41470 antisense
  13567 processed_transcript
   9165 miRNA
   6102 misc_RNA
   5748 snRNA
   4371 snoRNA
   3515 polymorphic_pseudogene
   3175 sense_intronic
   1581 rRNA
   1352 sense_overlapping
   1123 IG_V_gene
    763 TR_V_gene
    681 IG_V_pseudogene
    300 TR_J_gene
    185 IG_C_gene
    152 IG_D_gene
    100 3prime_overlapping_ncrna
     99 TR_V_pseudogene
     82 IG_J_gene
     66 Mt_tRNA
     58 TR_C_gene
     36 IG_C_pseudogene
     12 TR_D_gene
     12 TR_J_pseudogene
      9 IG_J_pseudogene
      6 Mt_rRNA

How many transcript_type?

zcat gencode.v19.annotation.gtf.gz | grep -v "^#" | perl -nle 'if (/transcript_type\s"(.*?)"/){ print "$1"}' | sort | uniq -c | sort -k1rn
1851464 protein_coding
 281286 nonsense_mediated_decay
 154780 processed_transcript
 135772 retained_intron
  54584 lincRNA
  44207 antisense
  22972 processed_pseudogene
  15239 pseudogene
  11202 unprocessed_pseudogene
   9287 miRNA
   7090 transcribed_unprocessed_pseudogene
   6134 misc_RNA
   5762 snRNA
   4515 snoRNA
   3148 sense_intronic
   1718 polymorphic_pseudogene
   1589 rRNA
   1430 unitary_pseudogene
   1417 sense_overlapping
   1123 IG_V_gene
   1091 transcribed_processed_pseudogene
   1070 non_stop_decay
    763 TR_V_gene
    681 IG_V_pseudogene
    300 TR_J_gene
    185 IG_C_gene
    152 IG_D_gene
    100 3prime_overlapping_ncrna
     99 TR_V_pseudogene
     82 IG_J_gene
     66 Mt_tRNA
     58 TR_C_gene
     36 IG_C_pseudogene
     12 TR_D_gene
     12 TR_J_pseudogene
      9 IG_J_pseudogene
      6 Mt_rRNA
      3 translated_processed_pseudogene

Download the long noncoding RNAs GTF file:

wget ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_19/gencode.v19.long_noncoding_RNAs.gtf.gz
zcat gencode.v19.long_noncoding_RNAs.gtf.gz| grep -v "^#" |
awk '$3=="transcript" {print}' |
perl -nle 'if (/gene_type\s"(.*?)"/){ print "$1"}' | sort | uniq -c | sort -k1rn
 11324 lincRNA
  9213 antisense
  2206 processed_transcript
   814 sense_intronic
   316 sense_overlapping
    25 3prime_overlapping_ncrna