GENCODE
See a summary on my blog: http://davetang.org/muse/2012/09/12/gencode/
Official website: http://www.gencodegenes.org/
Papers
The GENCODE v7 catalog of human long noncoding RNAs: Analysis of their gene structure, evolution, and expression - http://genome.cshlp.org/content/22/9/1775.full
Files
Download version 19 of the GENCODE gtf file (ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_19/):
wget ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_19/gencode.v19.annotation.gtf.gz
How many lines in GTF (General Feature Format) file with out the comments:
zcat gencode.v19.annotation.gtf.gz | grep -v "^#" | wc -l 2619444
How many annotation sources?
zcat gencode.v19.annotation.gtf.gz | grep -v "^#" | cut -f2 | sort | uniq -c 354499 ENSEMBL 2264945 HAVANA
How many feature types?
zcat gencode.v19.annotation.gtf.gz | grep -v "^#" | cut -f3 | sort | uniq -c 723784 CDS 1196293 exon 57820 gene 114 Selenocysteine 84144 start_codon 76196 stop_codon 196520 transcript 284573 UTR
How many annotation types?
zcat gencode.v19.annotation.gtf.gz | grep -v "^#" | cut -f9 | sed 's/;\s*/\n/g' | cut -f1 -d' ' | sort | uniq -c 2619444 963475 ccdsid 2080417 exon_id 2080417 exon_number 2619444 gene_id 2619444 gene_name 2619444 gene_status 2619444 gene_type 2585111 havana_gene 2216975 havana_transcript 2619444 level 46695 ont 4843866 tag 2619444 transcript_id 2619444 transcript_name 2619444 transcript_status 2619444 transcript_type
How many gene_type?
zcat gencode.v19.annotation.gtf.gz | grep -v "^#" | perl -nle 'if (/gene_type\s"(.*?)"/){ print "$1"}' | sort | uniq -c | sort -k1rn 2402832 protein_coding 70989 pseudogene 51893 lincRNA 41470 antisense 13567 processed_transcript 9165 miRNA 6102 misc_RNA 5748 snRNA 4371 snoRNA 3515 polymorphic_pseudogene 3175 sense_intronic 1581 rRNA 1352 sense_overlapping 1123 IG_V_gene 763 TR_V_gene 681 IG_V_pseudogene 300 TR_J_gene 185 IG_C_gene 152 IG_D_gene 100 3prime_overlapping_ncrna 99 TR_V_pseudogene 82 IG_J_gene 66 Mt_tRNA 58 TR_C_gene 36 IG_C_pseudogene 12 TR_D_gene 12 TR_J_pseudogene 9 IG_J_pseudogene 6 Mt_rRNA
How many transcript_type?
zcat gencode.v19.annotation.gtf.gz | grep -v "^#" | perl -nle 'if (/transcript_type\s"(.*?)"/){ print "$1"}' | sort | uniq -c | sort -k1rn 1851464 protein_coding 281286 nonsense_mediated_decay 154780 processed_transcript 135772 retained_intron 54584 lincRNA 44207 antisense 22972 processed_pseudogene 15239 pseudogene 11202 unprocessed_pseudogene 9287 miRNA 7090 transcribed_unprocessed_pseudogene 6134 misc_RNA 5762 snRNA 4515 snoRNA 3148 sense_intronic 1718 polymorphic_pseudogene 1589 rRNA 1430 unitary_pseudogene 1417 sense_overlapping 1123 IG_V_gene 1091 transcribed_processed_pseudogene 1070 non_stop_decay 763 TR_V_gene 681 IG_V_pseudogene 300 TR_J_gene 185 IG_C_gene 152 IG_D_gene 100 3prime_overlapping_ncrna 99 TR_V_pseudogene 82 IG_J_gene 66 Mt_tRNA 58 TR_C_gene 36 IG_C_pseudogene 12 TR_D_gene 12 TR_J_pseudogene 9 IG_J_pseudogene 6 Mt_rRNA 3 translated_processed_pseudogene
Download the long noncoding RNAs GTF file:
wget ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_19/gencode.v19.long_noncoding_RNAs.gtf.gz zcat gencode.v19.long_noncoding_RNAs.gtf.gz| grep -v "^#" | awk '$3=="transcript" {print}' | perl -nle 'if (/gene_type\s"(.*?)"/){ print "$1"}' | sort | uniq -c | sort -k1rn 11324 lincRNA 9213 antisense 2206 processed_transcript 814 sense_intronic 316 sense_overlapping 25 3prime_overlapping_ncrna