Read GTF file into R

The Gene Transfer Format (GTF) is a refinement of the General Feature Format (GFF). A GFF file has nine columns:

seqname The name of the sequence; must be a chromosome or scaffold.
source The program that generated this feature.
feature The name of this type of feature, e.g. "CDS", "start_codon", "stop_codon", and "exon"
start The starting position of the feature in the sequence; the first base is numbered 1.
end The ending position of the feature (inclusive).
score A score between 0 and 1000.
strand Valid entries include "+", "-", or ".".
frame If the feature is a coding exon, frame should be a number between 0-2 that represents the reading frame of the first base. If the feature is not a coding exon, the value should be ".".
group All lines with the same group are linked together into a single item.

The first eight fields in a GTF file are the same as GFF but the group field has been expanded into a list of attributes, where each attribute consists of a type/value pair. Attributes must end in a semi-colon and be separated from any following attribute by exactly one space. The attribute list must begin with the two mandatory attributes:

  1. gene_id value - A globally unique identifier for the genomic source of the sequence.
  2. transcript_id value - A globally unique identifier for the predicted transcript.

To get started, download an example GTF file. I've started working with Arabidopsis thaliana, so I'll use a GTF file corresponding to Arabidopsis thaliana genes.

wget -c ftp://ftp.ensemblgenomes.org/pub/release-36/plants/gtf/arabidopsis_thaliana/Arabidopsis_thaliana.TAIR10.36.gtf.gz
gunzip Arabidopsis_thaliana.TAIR10.36.gtf.gz

head Arabidopsis_thaliana.TAIR10.36.gtf 
#!genome-build TAIR10
#!genome-version TAIR10
#!genome-date 2010-09
#!genome-build-accession GCA_000001735.1
#!genebuild-last-updated 2010-09
1	araport11	gene	3631	5899	.	+	.	gene_id "AT1G01010"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding";
1	araport11	transcript	3631	5899	.	+	.	gene_id "AT1G01010"; transcript_id "AT1G01010.1"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding";
1	araport11	exon	3631	3913	.	+	.	gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "1"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding"; exon_id "AT1G01010.1.exon1";
1	araport11	CDS	3760	3913	.	+	0	gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "1"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding"; protein_id "AT1G01010.1"; protein_version "1";
1	araport11	start_codon	3760	3762	.	+	0	gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "1"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding";

cat Arabidopsis_thaliana.TAIR10.36.gtf | grep -v "^#" | wc -l
  888095

The refGenome package provides functions for reading and summarising GTF files.

install.packages("refGenome")
library(refGenome)

# change to directory where you downloaded GTF file
setwd("~/muse/refgenome")

# create ensemblGenome object for storing Ensembl genomic annotation data
ens <- ensemblGenome()

# read GTF file into ensemblGenome object
read.gtf(ens, "Arabidopsis_thaliana.TAIR10.36.gtf")
[read.gtf.refGenome] Reading file 'Arabidopsis_thaliana.TAIR10.36.gtf'.
[GTF]   888100 lines processed.
[read.gtf.refGenome] Extracting genes table.
[read.gtf.refGenome] Found 32,833 gene records.
[read.gtf.refGenome] Finished.

class(ens)
[1] "ensemblGenome"
attr(,"package")
[1] "refGenome"

We'll use some functions provided by the refGenome package to create summaries.

# counts all annotations on each seqname
tableSeqids(ens)

     1      2      3      4      5     Mt     Pt 
229818 132448 165053 131565 195121    690    567

# returns all annotations on mitochondria
extractSeqids(ens, 'Mt')
Object of class 'ensemblGenome' with 690 rows and 27 columns.
           id seqid    source     feature start   end score strand frame   2"   4"   5"   3"   8"  H3"
886555 886555    Mt araport11  transcript   273   734     .      -     . <NA> <NA> <NA> <NA> <NA> <NA>
886556 886556    Mt araport11        exon   273   734     .      -     . <NA> <NA> <NA> <NA> <NA> <NA>
886557 886557    Mt araport11         CDS   276   734     .      -     0 <NA> <NA> <NA> <NA> <NA> <NA>
886558 886558    Mt araport11 start_codon   732   734     .      -     0 <NA> <NA> <NA> <NA> <NA> <NA>
886559 886559    Mt araport11  stop_codon   273   275     .      -     0 <NA> <NA> <NA> <NA> <NA> <NA>
886561 886561    Mt araport11  transcript  8848 11415     .      -     . <NA> <NA> <NA> <NA> <NA> <NA>
       protein_version  protein_id           exon_id gene_name   gene_biotype exon_number   1"
886555            <NA>        <NA>              <NA>   ORF153A protein_coding        <NA> <NA>
886556            <NA>        <NA> ATMG00010.1.exon1   ORF153A protein_coding           1 <NA>
886557               1 ATMG00010.1              <NA>   ORF153A protein_coding           1 <NA>
886558            <NA>        <NA>              <NA>   ORF153A protein_coding           1 <NA>
886559            <NA>        <NA>              <NA>   ORF153A protein_coding           1 <NA>
886561            <NA>        <NA>              <NA>     RRN26           rRNA        <NA> <NA>
       transcript_biotype transcript_source transcript_id gene_source   gene_id
886555     protein_coding         araport11   ATMG00010.1   araport11 ATMG00010
886556     protein_coding         araport11   ATMG00010.1   araport11 ATMG00010
886557     protein_coding         araport11   ATMG00010.1   araport11 ATMG00010
886558     protein_coding         araport11   ATMG00010.1   araport11 ATMG00010
886559     protein_coding         araport11   ATMG00010.1   araport11 ATMG00010
886561               rRNA         araport11   ATMG00020.1   araport11 ATMG00020

# summarise features in GTF file
tableFeatures(ens)

            CDS            exon  five_prime_utr     start_codon      stop_codon three_prime_utr 
         285977          313952           56384           48315           48313           48308 
     transcript 
          54013

# create table of genes
my_gene <- getGenePositions(ens)
dim(my_gene)
[1] 32833    26

# gene IDs are unique
length(my_gene$gene_id)
[1] 32833
length(unique(my_gene$gene_id))
[1] 32833

library(dplyr)
# use dplyr to create more summaries
# number of genes on each seqname
my_gene %>% group_by(seqid) %>% summarise(n())
# A tibble: 7 x 2
  seqid `n()`
  <chr> <int>
1     1  8656
2     2  5174
3     3  6435
4     4  4921
5     5  7362
6    Mt   152
7    Pt   133

my_gene %>% filter(seqid == "Mt") %>% select(gene_id) %>% head()
    gene_id
1 ATMG00010
2 ATMG00020
3 ATMG00030
4 ATMG00040
5 ATMG00050
6 ATMG00060

my_gene %>% filter(seqid == 2) %>% select(gene_id) %>% head()
    gene_id
1 AT2G01008
2 AT2G03855
3 AT2G01010
4 AT2G01020
5 AT2G01021
6 AT2G01023

From the gene_id we can see that they follow a certain nomenclature. Let's check out the distribution of gene lengths.

# length of genes
my_gene_length <- gt$end - gt$start
my_density <- density(my_gene_length)
plot(my_density, main = 'Distribution of gene lengths')

What are the gene lengths at the two peaks?

# what's the first peak?
which.max(my_density$y)
[1] 15
my_density$x[which.max(my_density$y)]
[1] 285.3376

# find peaks in a sequence of numbers
find_peak <- function (x){
  which(diff(sign(diff(x))) < 0) + 1
}
find_peak(my_density$y)
 [1]  15  43 213 236 252 262 279 300 312 331 340 366 374 436 487 503

my_density$x[find_peak(my_density$y)[1:2]]
[1]  285.3376 1832.8927

plot(my_density, main = 'Distribution of gene lengths')
abline(v = c(my_density$x[find_peak(my_density$y)[1]],
             my_density$x[find_peak(my_density$y)[2]]),
       col = 'red', lty = 2)

text(x = my_density$x[find_peak(my_density$y)[1]] + 1500,
     y = my_density$y[find_peak(my_density$y)[1]],
     labels = round(my_density$x[find_peak(my_density$y)[1]], 2))

text(x = my_density$x[find_peak(my_density$y)[2]] + 1500,
     y = my_density$y[find_peak(my_density$y)[2]],
     labels = round(my_density$x[find_peak(my_density$y)[2]], 2))

Summarise the biotypes and create a plot.

table(my_gene$gene_biotype)

          atlncRNA              atRNA             lncRNA              miRNA nontranslating_CDS 
              1037                 78               2444                325                 27 
          otherRNA     protein_coding               rRNA             snoRNA              snRNA 
               221              27628                 15                287                 82 
              tRNA 
               689

my_gene %>% group_by(seqid, gene_biotype) %>% summarise(count = n()) -> my_tally

ggplot(my_tally, aes(x = seqid, y = log2(count))) +
  geom_bar(aes(fill = gene_biotype), stat = 'identity', position = 'dodge')

The majority of features are protein-coding genes as in other genomes.

It's easy to create a GRanges using the genes data frame.

library(GenomicRanges)
my_gr <- with(my_gene, GRanges(seqid, IRanges(start, end), strand, id = gene_id))

library(Gviz)
ref <- GRanges()
ref_track <- GenomeAxisTrack()
options(ucscChromosomeNames=FALSE)
data_track <- AnnotationTrack(my_gr, name = "Genes", showFeatureId = TRUE)
plotTracks(c(ref_track, data_track),
           from = 1, to = 10000)

Using rtracklayer

As pointed out in a comment, one can use rtracklayer to import a GTF file.

# install package if you haven't already
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("rtracklayer")

library('rtracklayer')
my_file <- "Arabidopsis_thaliana.TAIR10.36.gtf"
my_obj <- import(my_file)

class(my_obj)
[1] "GRanges"
attr(,"package")
[1] "GenomicRanges"

my_obj  
GRanges object with 888095 ranges and 15 metadata columns:
           seqnames           ranges strand |    source        type     score     phase     gene_id
              <Rle>        <IRanges>  <Rle> |  <factor>    <factor> <numeric> <integer> <character>
       [1]        1     [3631, 5899]      + | araport11        gene      <NA>      <NA>   AT1G01010
       [2]        1     [3631, 5899]      + | araport11  transcript      <NA>      <NA>   AT1G01010
       [3]        1     [3631, 3913]      + | araport11        exon      <NA>      <NA>   AT1G01010
       [4]        1     [3760, 3913]      + | araport11         CDS      <NA>         0   AT1G01010
       [5]        1     [3760, 3762]      + | araport11 start_codon      <NA>         0   AT1G01010
       ...      ...              ...    ... .       ...         ...       ...       ...         ...
  [888091]       Pt [152806, 153195]      + | araport11         CDS      <NA>         0   ATCG01310
  [888092]       Pt [152806, 152808]      + | araport11 start_codon      <NA>         0   ATCG01310
  [888093]       Pt [153878, 154312]      + | araport11        exon      <NA>      <NA>   ATCG01310
  [888094]       Pt [153878, 154309]      + | araport11         CDS      <NA>         0   ATCG01310
  [888095]       Pt [154310, 154312]      + | araport11  stop_codon      <NA>         0   ATCG01310
             gene_name gene_source   gene_biotype transcript_id transcript_source transcript_biotype
           <character> <character>    <character>   <character>       <character>        <character>
       [1]      NAC001   araport11 protein_coding          <NA>              <NA>               <NA>
       [2]      NAC001   araport11 protein_coding   AT1G01010.1         araport11     protein_coding
       [3]      NAC001   araport11 protein_coding   AT1G01010.1         araport11     protein_coding
       [4]      NAC001   araport11 protein_coding   AT1G01010.1         araport11     protein_coding
       [5]      NAC001   araport11 protein_coding   AT1G01010.1         araport11     protein_coding
       ...         ...         ...            ...           ...               ...                ...
  [888091]      rpl2-B   araport11 protein_coding   ATCG01310.1         araport11     protein_coding
  [888092]      rpl2-B   araport11 protein_coding   ATCG01310.1         araport11     protein_coding
  [888093]      rpl2-B   araport11 protein_coding   ATCG01310.1         araport11     protein_coding
  [888094]      rpl2-B   araport11 protein_coding   ATCG01310.1         araport11     protein_coding
  [888095]      rpl2-B   araport11 protein_coding   ATCG01310.1         araport11     protein_coding
           exon_number           exon_id  protein_id protein_version
           <character>       <character> <character>     <character>
       [1]        <NA>              <NA>        <NA>            <NA>
       [2]        <NA>              <NA>        <NA>            <NA>
       [3]           1 AT1G01010.1.exon1        <NA>            <NA>
       [4]           1              <NA> AT1G01010.1               1
       [5]           1              <NA>        <NA>            <NA>
       ...         ...               ...         ...             ...
  [888091]           1              <NA> ATCG01310.1               1
  [888092]           1              <NA>        <NA>            <NA>
  [888093]           2 ATCG01310.1.exon2        <NA>            <NA>
  [888094]           2              <NA> ATCG01310.1               1
  [888095]           2              <NA>        <NA>            <NA>
  -------
  seqinfo: 7 sequences from an unspecified genome; no seqlengths

Summary

The refGenome() package makes it easy to load GTF files into R. The package also contains functions for creating summaries. One can also use the dplyr package to create your own summaries. From this, I've learned some properties of Arabidopsis thaliana genes.

Print Friendly, PDF & Email



Creative Commons License
This work is licensed under a Creative Commons
Attribution 4.0 International License
.
10 comments Add yours
  1. I wonder, how do you make these neat graphics with R codes and texts
    appear in your blog? Rmarkdown?

    I would like to make a similar blog like yours. Is it hard?
    Thank you for the hints.
    I’ve learnt a lot with your blog already. 🙂

    1. The back-end for this blog is WordPress but I don’t really recommend it because it’s a bit tedious. I work in RStudio and export all the images as PNG files and import them into WordPress. I highly recommend R Markdown, which is usually how I work.

      As for blogging, I recommend using Jekyll and GitHub pages. This is a good guide to follow https://www.smashingmagazine.com/2014/08/build-blog-jekyll-github-pages/ and it’s not hard.

      And I’m glad you have learned a lot from this blog! Motivates me to write more 🙂

  2. Hi,

    THanks for this useful blog, are there any packages similary to parse and summarize a gff3 annotation file.

    Kapeel

  3. Hi,

    I keep getting errors like Error in group_by(., seqid) : could not find function “group_by” or Error in my_gene %>% group_by(seqid) %>% summarise(n()) :
    could not find function “%>%”

    I have all packages necessary installed, where is this error coming from?

    Thank you

  4. Hi,
    I have recently tried to install both refGenome and rtrackerlayer but both have been removed from CRAN. Are there any other alternatives?

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.