The Gene Transfer Format (GTF) is a refinement of the General Feature Format (GFF). A GFF file has nine columns:
seqname | The name of the sequence; must be a chromosome or scaffold. |
source | The program that generated this feature. |
feature | The name of this type of feature, e.g. "CDS", "start_codon", "stop_codon", and "exon" |
start | The starting position of the feature in the sequence; the first base is numbered 1. |
end | The ending position of the feature (inclusive). |
score | A score between 0 and 1000. |
strand | Valid entries include "+", "-", or ".". |
frame | If the feature is a coding exon, frame should be a number between 0-2 that represents the reading frame of the first base. If the feature is not a coding exon, the value should be ".". |
group | All lines with the same group are linked together into a single item. |
The first eight fields in a GTF file are the same as GFF but the group field has been expanded into a list of attributes, where each attribute consists of a type/value pair. Attributes must end in a semi-colon and be separated from any following attribute by exactly one space. The attribute list must begin with the two mandatory attributes:
- gene_id value - A globally unique identifier for the genomic source of the sequence.
- transcript_id value - A globally unique identifier for the predicted transcript.
To get started, download an example GTF file. I've started working with Arabidopsis thaliana, so I'll use a GTF file corresponding to Arabidopsis thaliana genes.
wget -c ftp://ftp.ensemblgenomes.org/pub/release-36/plants/gtf/arabidopsis_thaliana/Arabidopsis_thaliana.TAIR10.36.gtf.gz gunzip Arabidopsis_thaliana.TAIR10.36.gtf.gz head Arabidopsis_thaliana.TAIR10.36.gtf #!genome-build TAIR10 #!genome-version TAIR10 #!genome-date 2010-09 #!genome-build-accession GCA_000001735.1 #!genebuild-last-updated 2010-09 1 araport11 gene 3631 5899 . + . gene_id "AT1G01010"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; 1 araport11 transcript 3631 5899 . + . gene_id "AT1G01010"; transcript_id "AT1G01010.1"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding"; 1 araport11 exon 3631 3913 . + . gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "1"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding"; exon_id "AT1G01010.1.exon1"; 1 araport11 CDS 3760 3913 . + 0 gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "1"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding"; protein_id "AT1G01010.1"; protein_version "1"; 1 araport11 start_codon 3760 3762 . + 0 gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "1"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding"; cat Arabidopsis_thaliana.TAIR10.36.gtf | grep -v "^#" | wc -l 888095
The refGenome package provides functions for reading and summarising GTF files.
install.packages("refGenome") library(refGenome) # change to directory where you downloaded GTF file setwd("~/muse/refgenome") # create ensemblGenome object for storing Ensembl genomic annotation data ens <- ensemblGenome() # read GTF file into ensemblGenome object read.gtf(ens, "Arabidopsis_thaliana.TAIR10.36.gtf") [read.gtf.refGenome] Reading file 'Arabidopsis_thaliana.TAIR10.36.gtf'. [GTF] 888100 lines processed. [read.gtf.refGenome] Extracting genes table. [read.gtf.refGenome] Found 32,833 gene records. [read.gtf.refGenome] Finished. class(ens) [1] "ensemblGenome" attr(,"package") [1] "refGenome"
We'll use some functions provided by the refGenome package to create summaries.
# counts all annotations on each seqname tableSeqids(ens) 1 2 3 4 5 Mt Pt 229818 132448 165053 131565 195121 690 567 # returns all annotations on mitochondria extractSeqids(ens, 'Mt') Object of class 'ensemblGenome' with 690 rows and 27 columns. id seqid source feature start end score strand frame 2" 4" 5" 3" 8" H3" 886555 886555 Mt araport11 transcript 273 734 . - . <NA> <NA> <NA> <NA> <NA> <NA> 886556 886556 Mt araport11 exon 273 734 . - . <NA> <NA> <NA> <NA> <NA> <NA> 886557 886557 Mt araport11 CDS 276 734 . - 0 <NA> <NA> <NA> <NA> <NA> <NA> 886558 886558 Mt araport11 start_codon 732 734 . - 0 <NA> <NA> <NA> <NA> <NA> <NA> 886559 886559 Mt araport11 stop_codon 273 275 . - 0 <NA> <NA> <NA> <NA> <NA> <NA> 886561 886561 Mt araport11 transcript 8848 11415 . - . <NA> <NA> <NA> <NA> <NA> <NA> protein_version protein_id exon_id gene_name gene_biotype exon_number 1" 886555 <NA> <NA> <NA> ORF153A protein_coding <NA> <NA> 886556 <NA> <NA> ATMG00010.1.exon1 ORF153A protein_coding 1 <NA> 886557 1 ATMG00010.1 <NA> ORF153A protein_coding 1 <NA> 886558 <NA> <NA> <NA> ORF153A protein_coding 1 <NA> 886559 <NA> <NA> <NA> ORF153A protein_coding 1 <NA> 886561 <NA> <NA> <NA> RRN26 rRNA <NA> <NA> transcript_biotype transcript_source transcript_id gene_source gene_id 886555 protein_coding araport11 ATMG00010.1 araport11 ATMG00010 886556 protein_coding araport11 ATMG00010.1 araport11 ATMG00010 886557 protein_coding araport11 ATMG00010.1 araport11 ATMG00010 886558 protein_coding araport11 ATMG00010.1 araport11 ATMG00010 886559 protein_coding araport11 ATMG00010.1 araport11 ATMG00010 886561 rRNA araport11 ATMG00020.1 araport11 ATMG00020 # summarise features in GTF file tableFeatures(ens) CDS exon five_prime_utr start_codon stop_codon three_prime_utr 285977 313952 56384 48315 48313 48308 transcript 54013 # create table of genes my_gene <- getGenePositions(ens) dim(my_gene) [1] 32833 26 # gene IDs are unique length(my_gene$gene_id) [1] 32833 length(unique(my_gene$gene_id)) [1] 32833 library(dplyr) # use dplyr to create more summaries # number of genes on each seqname my_gene %>% group_by(seqid) %>% summarise(n()) # A tibble: 7 x 2 seqid `n()` <chr> <int> 1 1 8656 2 2 5174 3 3 6435 4 4 4921 5 5 7362 6 Mt 152 7 Pt 133 my_gene %>% filter(seqid == "Mt") %>% select(gene_id) %>% head() gene_id 1 ATMG00010 2 ATMG00020 3 ATMG00030 4 ATMG00040 5 ATMG00050 6 ATMG00060 my_gene %>% filter(seqid == 2) %>% select(gene_id) %>% head() gene_id 1 AT2G01008 2 AT2G03855 3 AT2G01010 4 AT2G01020 5 AT2G01021 6 AT2G01023
From the gene_id we can see that they follow a certain nomenclature. Let's check out the distribution of gene lengths.
# length of genes my_gene_length <- gt$end - gt$start my_density <- density(my_gene_length) plot(my_density, main = 'Distribution of gene lengths')
What are the gene lengths at the two peaks?
# what's the first peak? which.max(my_density$y) [1] 15 my_density$x[which.max(my_density$y)] [1] 285.3376 # find peaks in a sequence of numbers find_peak <- function (x){ which(diff(sign(diff(x))) < 0) + 1 } find_peak(my_density$y) [1] 15 43 213 236 252 262 279 300 312 331 340 366 374 436 487 503 my_density$x[find_peak(my_density$y)[1:2]] [1] 285.3376 1832.8927 plot(my_density, main = 'Distribution of gene lengths') abline(v = c(my_density$x[find_peak(my_density$y)[1]], my_density$x[find_peak(my_density$y)[2]]), col = 'red', lty = 2) text(x = my_density$x[find_peak(my_density$y)[1]] + 1500, y = my_density$y[find_peak(my_density$y)[1]], labels = round(my_density$x[find_peak(my_density$y)[1]], 2)) text(x = my_density$x[find_peak(my_density$y)[2]] + 1500, y = my_density$y[find_peak(my_density$y)[2]], labels = round(my_density$x[find_peak(my_density$y)[2]], 2))
Summarise the biotypes and create a plot.
table(my_gene$gene_biotype) atlncRNA atRNA lncRNA miRNA nontranslating_CDS 1037 78 2444 325 27 otherRNA protein_coding rRNA snoRNA snRNA 221 27628 15 287 82 tRNA 689 my_gene %>% group_by(seqid, gene_biotype) %>% summarise(count = n()) -> my_tally ggplot(my_tally, aes(x = seqid, y = log2(count))) + geom_bar(aes(fill = gene_biotype), stat = 'identity', position = 'dodge')
The majority of features are protein-coding genes as in other genomes.
It's easy to create a GRanges using the genes data frame.
library(GenomicRanges) my_gr <- with(my_gene, GRanges(seqid, IRanges(start, end), strand, id = gene_id)) library(Gviz) ref <- GRanges() ref_track <- GenomeAxisTrack() options(ucscChromosomeNames=FALSE) data_track <- AnnotationTrack(my_gr, name = "Genes", showFeatureId = TRUE) plotTracks(c(ref_track, data_track), from = 1, to = 10000)
Using rtracklayer
As pointed out in a comment, one can use rtracklayer to import a GTF file.
# install package if you haven't already if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("rtracklayer") library('rtracklayer') my_file <- "Arabidopsis_thaliana.TAIR10.36.gtf" my_obj <- import(my_file) class(my_obj) [1] "GRanges" attr(,"package") [1] "GenomicRanges" my_obj GRanges object with 888095 ranges and 15 metadata columns: seqnames ranges strand | source type score phase gene_id <Rle> <IRanges> <Rle> | <factor> <factor> <numeric> <integer> <character> [1] 1 [3631, 5899] + | araport11 gene <NA> <NA> AT1G01010 [2] 1 [3631, 5899] + | araport11 transcript <NA> <NA> AT1G01010 [3] 1 [3631, 3913] + | araport11 exon <NA> <NA> AT1G01010 [4] 1 [3760, 3913] + | araport11 CDS <NA> 0 AT1G01010 [5] 1 [3760, 3762] + | araport11 start_codon <NA> 0 AT1G01010 ... ... ... ... . ... ... ... ... ... [888091] Pt [152806, 153195] + | araport11 CDS <NA> 0 ATCG01310 [888092] Pt [152806, 152808] + | araport11 start_codon <NA> 0 ATCG01310 [888093] Pt [153878, 154312] + | araport11 exon <NA> <NA> ATCG01310 [888094] Pt [153878, 154309] + | araport11 CDS <NA> 0 ATCG01310 [888095] Pt [154310, 154312] + | araport11 stop_codon <NA> 0 ATCG01310 gene_name gene_source gene_biotype transcript_id transcript_source transcript_biotype <character> <character> <character> <character> <character> <character> [1] NAC001 araport11 protein_coding <NA> <NA> <NA> [2] NAC001 araport11 protein_coding AT1G01010.1 araport11 protein_coding [3] NAC001 araport11 protein_coding AT1G01010.1 araport11 protein_coding [4] NAC001 araport11 protein_coding AT1G01010.1 araport11 protein_coding [5] NAC001 araport11 protein_coding AT1G01010.1 araport11 protein_coding ... ... ... ... ... ... ... [888091] rpl2-B araport11 protein_coding ATCG01310.1 araport11 protein_coding [888092] rpl2-B araport11 protein_coding ATCG01310.1 araport11 protein_coding [888093] rpl2-B araport11 protein_coding ATCG01310.1 araport11 protein_coding [888094] rpl2-B araport11 protein_coding ATCG01310.1 araport11 protein_coding [888095] rpl2-B araport11 protein_coding ATCG01310.1 araport11 protein_coding exon_number exon_id protein_id protein_version <character> <character> <character> <character> [1] <NA> <NA> <NA> <NA> [2] <NA> <NA> <NA> <NA> [3] 1 AT1G01010.1.exon1 <NA> <NA> [4] 1 <NA> AT1G01010.1 1 [5] 1 <NA> <NA> <NA> ... ... ... ... ... [888091] 1 <NA> ATCG01310.1 1 [888092] 1 <NA> <NA> <NA> [888093] 2 ATCG01310.1.exon2 <NA> <NA> [888094] 2 <NA> ATCG01310.1 1 [888095] 2 <NA> <NA> <NA> ------- seqinfo: 7 sequences from an unspecified genome; no seqlengths
Summary
The refGenome() package makes it easy to load GTF files into R. The package also contains functions for creating summaries. One can also use the dplyr package to create your own summaries. From this, I've learned some properties of Arabidopsis thaliana genes.

This work is licensed under a Creative Commons
Attribution 4.0 International License.
I am not sure that, why don’t just use rtracklayer packages to import gtf files?
Thanks for the comment; I’ve updated the post.
I wonder, how do you make these neat graphics with R codes and texts
appear in your blog? Rmarkdown?
I would like to make a similar blog like yours. Is it hard?
Thank you for the hints.
I’ve learnt a lot with your blog already. 🙂
The back-end for this blog is WordPress but I don’t really recommend it because it’s a bit tedious. I work in RStudio and export all the images as PNG files and import them into WordPress. I highly recommend R Markdown, which is usually how I work.
As for blogging, I recommend using Jekyll and GitHub pages. This is a good guide to follow https://www.smashingmagazine.com/2014/08/build-blog-jekyll-github-pages/ and it’s not hard.
And I’m glad you have learned a lot from this blog! Motivates me to write more 🙂
Hi,
THanks for this useful blog, are there any packages similary to parse and summarize a gff3 annotation file.
Kapeel
Yes, use rtracklayer.
Hi,
I keep getting errors like Error in group_by(., seqid) : could not find function “group_by” or Error in my_gene %>% group_by(seqid) %>% summarise(n()) :
could not find function “%>%”
I have all packages necessary installed, where is this error coming from?
Thank you
You need the R package called dplyr.
Hi,
I have recently tried to install both refGenome and rtrackerlayer but both have been removed from CRAN. Are there any other alternatives?
Sorry, I should point out that rtracklayer is a Bioconductor package (see https://www.bioconductor.org/packages/release/bioc/html/rtracklayer.html for installation instructions). I have updated the post.