Defining genomic regions

Updated 2014 June 24th to use GENCODE version 19

RNA sequencing (RNA-Seq) reads are typically mapped back to the genome (or transcriptome in some cases) after sequencing. The next task is to annotate the reads, to see which regions the reads mapped to. Typically one creates an annotation file and compares the coordinates of the mapped reads to the annotation file. Creating this annotation file is quite easy using BEDTools; in this post I refer to the creation of the annotation file as defining genomic regions, since in the end I will have several files that contain coordinates of exonic, intronic, and intergenic regions. I will define these regions with respect to GENCODE annotations.

First let's install the latest version of BEDTools:

git clone https://github.com/arq5x/bedtools2.git
cd bedtools2
make clean; make all
bin/bedtools --version
#bedtools v2.20.1-4-gb877b35
cd ..

Now download the GENCODE version 19 (which is currently the latest update):

v=19
wget ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_$v/gencode.v$v.annotation.gtf.gz

#what does this file look like?
zcat gencode.v19.annotation.gtf.gz | head
##description: evidence-based annotation of the human genome (GRCh37), version 19 (Ensembl 74)
##provider: GENCODE
##contact: gencode@sanger.ac.uk
##format: gtf
##date: 2013-12-05
chr1    HAVANA  gene    11869   14412   .       +       .       gene_id "ENSG00000223972.4"; transcript_id "ENSG00000223972.4"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type "pseudogene"; transcript_status "KNOWN"; transcript_name "DDX11L1"; level 2; havana_gene "OTTHUMG00000000961.2";
chr1    HAVANA  transcript      11869   14409   .       +       .       gene_id "ENSG00000223972.4"; transcript_id "ENST00000456328.2"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_status "KNOWN"; transcript_name "DDX11L1-002"; level 2; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";
chr1    HAVANA  exon    11869   12227   .       +       .       gene_id "ENSG00000223972.4"; transcript_id "ENST00000456328.2"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_status "KNOWN"; transcript_name "DDX11L1-002"; exon_number 1;  exon_id "ENSE00002234944.1";  level 2; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";
chr1    HAVANA  exon    12613   12721   .       +       .       gene_id "ENSG00000223972.4"; transcript_id "ENST00000456328.2"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_status "KNOWN"; transcript_name "DDX11L1-002"; exon_number 2;  exon_id "ENSE00003582793.1";  level 2; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";
chr1    HAVANA  exon    13221   14409   .       +       .       gene_id "ENSG00000223972.4"; transcript_id "ENST00000456328.2"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_status "KNOWN"; transcript_name "DDX11L1-002"; exon_number 3;  exon_id "ENSE00002312635.1";  level 2; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";

In the third column of the GTF file corresponds to the feature type. How many different types of features are there?

zcat gencode.v19.annotation.gtf.gz | grep -v "^#" | cut -f3 | sort | uniq -c | sort -k1rn
1196293 exon
 723784 CDS
 284573 UTR
 196520 transcript
  84144 start_codon
  76196 stop_codon
  57820 gene
    114 Selenocysteine

So the exonic regions are already defined; however, I would like to remove overlapping exons. The mergeBed utility in BEDTools can accomplish this; however before merging make sure the coordinates are sorted. We can use sortBed to sort the BED file:

v=19
zcat gencode.v$v.annotation.gtf.gz |
awk 'BEGIN{OFS="\t";} $3=="exon" {print $1,$4-1,$5}' |
bedtools2/bin/sortBed |
bedtools2/bin/mergeBed -i - | gzip > gencode_v${v}_exon_merged.bed.gz

To define intronic regions, we just need to subtract the exonic region from the genic region. The utility subtractBed can do this:

v=19
zcat gencode.v$v.annotation.gtf.gz |
awk 'BEGIN{OFS="\t";} $3=="gene" {print $1,$4-1,$5}' |
bedtools2/bin/sortBed |
bedtools2/bin/subtractBed -a stdin -b gencode_v${v}_exon_merged.bed.gz |
gzip > gencode_v${v}_intron.bed.gz

#let's intersect the two files
#this shouldn't produce any output
intersectBed -a gencode_v${v}_exon_merged.bed.gz -b gencode_v${v}_intron.bed.gz

And finally to define intergenic regions, we use complementBed to find regions not covered by genes.

v=19
mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -e \
        "select chrom, size from hg19.chromInfo"  > hg19.genome

zcat gencode.v$v.annotation.gtf.gz |
awk 'BEGIN{OFS="\t";} $3=="gene" {print $1,$4-1,$5}' |
bedtools2/bin/sortBed |
bedtools2/bin/complementBed -i stdin -g hg19.genome |
gzip > gencode_v${v}_intergenic.bed.gz

And that's it.

bedtoolsHere's a schematic showing the steps we performed above.

Genome coverage

How much of the genome does exons, introns, and intergenic regions cover?

#!/bin/env perl

use strict;
use warnings;

my $v = 19;

my $exon_file       = "gencode_v${v}_exon_merged.bed.gz";
my $intergenic_file = "gencode_v${v}_intergenic.bed.gz";
my $intron_file     = "gencode_v${v}_intron.bed.gz";

my $exon_coverage       = coverage($exon_file);
my $intergenic_coverage = coverage($intergenic_file);
my $intron_coverage     = coverage($intron_file);

my $total = $exon_coverage + $intergenic_coverage + $intron_coverage;

printf "Exon: %.2f\n", $exon_coverage*100/$total;
printf "Intron: %.2f\n", $intron_coverage*100/$total;
printf "Intergenic: %.2f\n", $intergenic_coverage*100/$total;

sub coverage {
   my ($infile) = @_;
   my $coverage = 0;
   open(IN,'-|',"zcat $infile") || die "Could not open $infile: $!\n";
   while(<IN>){
      chomp;
      my ($chr, $start, $end) = split(/\t/);
      my $c = $end - $start;
      $coverage += $c;
   }
   close(IN);
   return($coverage);
}

exit(0);

Running this script:

coverage.pl
Exon: 3.72
Intron: 48.25
Intergenic: 48.03

I had initially thought that the intergenic region would have the highest percent coverage in the genome compared to introns and exons.

The untranslated region (UTR)

I got a question (see comments) regarding the untranslated region (UTR). To illustrate my answer, I'll use the transcript ENST00000335295 as an example:

v=19
zcat gencode.v$v.annotation.gtf.gz | grep ENST00000335295
chr11   HAVANA  transcript      5246694 5248301 .       -       .       gene_id "ENSG00000244734.2"; transcript_id "ENST00000335295.4"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "HBB"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "HBB-001"; level 2; tag "basic"; tag "appris_principal"; tag "CCDS"; ccdsid "CCDS7753.1"; havana_gene "OTTHUMG00000066678.3"; havana_transcript "OTTHUMT00000142977.2";
chr11   HAVANA  exon    5248160 5248301 .       -       .       gene_id "ENSG00000244734.2"; transcript_id "ENST00000335295.4"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "HBB"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "HBB-001"; exon_number 1;  exon_id "ENSE00001829867.2";  level 2; tag "basic"; tag "appris_principal"; tag "CCDS"; ccdsid "CCDS7753.1"; havana_gene "OTTHUMG00000066678.3"; havana_transcript "OTTHUMT00000142977.2";
chr11   HAVANA  CDS     5248160 5248251 .       -       0       gene_id "ENSG00000244734.2"; transcript_id "ENST00000335295.4"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "HBB"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "HBB-001"; exon_number 1;  exon_id "ENSE00001829867.2";  level 2; tag "basic"; tag "appris_principal"; tag "CCDS"; ccdsid "CCDS7753.1"; havana_gene "OTTHUMG00000066678.3"; havana_transcript "OTTHUMT00000142977.2";
chr11   HAVANA  start_codon     5248249 5248251 .       -       0       gene_id "ENSG00000244734.2"; transcript_id "ENST00000335295.4"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "HBB"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "HBB-001"; exon_number 1;  exon_id "ENSE00001829867.2";  level 2; tag "basic"; tag "appris_principal"; tag "CCDS"; ccdsid "CCDS7753.1"; havana_gene "OTTHUMG00000066678.3"; havana_transcript "OTTHUMT00000142977.2";
chr11   HAVANA  exon    5247807 5248029 .       -       .       gene_id "ENSG00000244734.2"; transcript_id "ENST00000335295.4"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "HBB"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "HBB-001"; exon_number 2;  exon_id "ENSE00001057381.1";  level 2; tag "basic"; tag "appris_principal"; tag "CCDS"; ccdsid "CCDS7753.1"; havana_gene "OTTHUMG00000066678.3"; havana_transcript "OTTHUMT00000142977.2";
chr11   HAVANA  CDS     5247807 5248029 .       -       1       gene_id "ENSG00000244734.2"; transcript_id "ENST00000335295.4"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "HBB"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "HBB-001"; exon_number 2;  exon_id "ENSE00001057381.1";  level 2; tag "basic"; tag "appris_principal"; tag "CCDS"; ccdsid "CCDS7753.1"; havana_gene "OTTHUMG00000066678.3"; havana_transcript "OTTHUMT00000142977.2";
chr11   HAVANA  exon    5246694 5246956 .       -       .       gene_id "ENSG00000244734.2"; transcript_id "ENST00000335295.4"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "HBB"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "HBB-001"; exon_number 3;  exon_id "ENSE00001600613.2";  level 2; tag "basic"; tag "appris_principal"; tag "CCDS"; ccdsid "CCDS7753.1"; havana_gene "OTTHUMG00000066678.3"; havana_transcript "OTTHUMT00000142977.2";
chr11   HAVANA  CDS     5246831 5246956 .       -       0       gene_id "ENSG00000244734.2"; transcript_id "ENST00000335295.4"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "HBB"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "HBB-001"; exon_number 3;  exon_id "ENSE00001600613.2";  level 2; tag "basic"; tag "appris_principal"; tag "CCDS"; ccdsid "CCDS7753.1"; havana_gene "OTTHUMG00000066678.3"; havana_transcript "OTTHUMT00000142977.2";
chr11   HAVANA  stop_codon      5246828 5246830 .       -       0       gene_id "ENSG00000244734.2"; transcript_id "ENST00000335295.4"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "HBB"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "HBB-001"; exon_number 3;  exon_id "ENSE00001600613.2";  level 2; tag "basic"; tag "appris_principal"; tag "CCDS"; ccdsid "CCDS7753.1"; havana_gene "OTTHUMG00000066678.3"; havana_transcript "OTTHUMT00000142977.2";
chr11   HAVANA  UTR     5248252 5248301 .       -       .       gene_id "ENSG00000244734.2"; transcript_id "ENST00000335295.4"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "HBB"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "HBB-001"; level 2; tag "basic"; tag "appris_principal"; tag "CCDS"; ccdsid "CCDS7753.1"; havana_gene "OTTHUMG00000066678.3"; havana_transcript "OTTHUMT00000142977.2";
chr11   HAVANA  UTR     5246694 5246830 .       -       .       gene_id "ENSG00000244734.2"; transcript_id "ENST00000335295.4"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "HBB"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "HBB-001"; level 2; tag "basic"; tag "appris_principal"; tag "CCDS"; ccdsid "CCDS7753.1"; havana_gene "OTTHUMG00000066678.3"; havana_transcript "OTTHUMT00000142977.2";

This transcript (HBB) starts at 5248301 and ends at 5246694 on chromosome 11 (the coordinates are reversed because it's on the negative strand). We see two lines with the UTR feature for this transcript, one starting at 5248301 and the other at 5246830.

So the coordinates for the 5' UTR are chr11:5248252-5248301 and the 3' UTR are chr11:5246694-5246830. Perhaps I should have picked a transcript that's on the positive strand, however this was one example I knew where the transcript is short and well defined (contains the start and end codons, and the 5' and 3' UTRs). Which now leads me to wonder, how many transcripts have defined UTRs?

#!/bin/env perl

use strict;
use warnings;

my $usage = "Usage: $0 <infile.annotation.gtf.gz>\n";
my $infile = shift or die $usage;

if ($infile =~ /\.gz/){
   open(IN,'-|',"gunzip -c $infile") || die "Could not open $infile: $!\n";
} else {
   open(IN,'<',$infile) || die "Could not open $infile: $!\n";
}

my %transcript = ();
my $current_transcript = '';

while(<IN>){
   chomp;
   next if (/^#/);
   #chr11   HAVANA  transcript      65265233        65273940        .       +       .       gene_id "ENSG00000251562.3"; transcript_id "ENST00000534336.1"; gene_type "processed_transcript"; gene_status "KNOWN"; gene_name "MALAT1"; transcript_type "non_coding"; transcript_status "KNOWN"; transcript_name "MALAT1-001"; level 2; havana_gene "OTTHUMG00000166322.1"; havana_transcript "OTTHUMT00000389143.1";
   my ($chr,$source,$type,$start,$end,$score,$strand,$phase,$annotation) = split(/\t/);

   my @annotation = split(/;\s/,$annotation);
   my $transcript_id = 'none';

   if ($type eq 'transcript'){
      foreach my $blah (@annotation){
         my ($type,$name) = split(/\s+/,$blah);
         if ($type eq 'transcript_id'){
            $current_transcript = $name;
            $current_transcript =~ s/"//g;
            $transcript{$current_transcript} = 0;
         }
      }
      if ($current_transcript eq 'none'){
         die "No name for entry $.\n";
      }
   }

   if ($type eq 'UTR'){
      $transcript{$current_transcript}++;
   }
}
close(IN);

foreach my $transcript (keys %transcript){
   print "$transcript\t$transcript{$transcript}\n";
}

exit(0);

Now to run the script:

check_utr.pl gencode.v19.annotation.gtf.gz | gzip > transcript_utr_number.out.gz

zcat transcript_utr_number.out | wc -l
196520

zcat transcript_utr_number.out | cut -f2 | sort | uniq -c | sort -k1rn | head
 103800 0
  34999 2
  24234 3
  11543 1
  10352 4
   4493 5
   2264 6
   1297 7
    796 8
    583 9

zcat transcript_utr_number.out | awk '$2==57 {print}'
ENST00000264319.7       57

Running the script above, I find that over half of the GENCODE transcripts (103,800/196,520) don't have a defined UTRs. 34,999 transcripts have 2 defined UTRs, 24,234 have 3 because one of the UTR is spliced. The transcript ENST00000264319.7 had 57 UTR lines, which was the most UTR annotations for a transcript.

Defining the 5' and 3' UTRs

I'm not entirely sure why the UTRs weren't separated into 5' and 3', especially when GENCODE went through the trouble of labelling the start_codon and the stop_codon. I'm going to assume that if a UTR is closer to the starting position of the transcript it's the 5' UTR and vice versa.

#!/bin/env perl

use strict;
use warnings;

my $usage = "Usage: $0 <infile.annotation.gtf>\n";
my $infile = shift or die $usage;

if ($infile =~ /\.gz/){
   open(IN,'-|',"gunzip -c $infile") || die "Could not open $infile: $!\n";
} else {
   open(IN,'<',$infile) || die "Could not open $infile: $!\n";
}

my %transcript = ();
my $current_transcript = '';
my $transcript_start = 0;
my $transcript_end = 0;

while(<IN>){
   chomp;
   next if (/^#/);
   #chr11   HAVANA  transcript      65265233        65273940        .       +       .       gene_id "ENSG00000251562.3"; transcript_id "ENST00000534336.1"; gene_type "processed_transcript"; gene_status "KNOWN"; gene_name "MALAT1"; transcript_type "non_coding"; transcript_status "KNOWN"; transcript_name "MALAT1-001"; level 2; havana_gene "OTTHUMG00000166322.1"; havana_transcript "OTTHUMT00000389143.1";
   my ($chr,$source,$type,$start,$end,$score,$strand,$phase,$annotation) = split(/\t/);

   my @annotation = split(/;\s/,$annotation);
   my $transcript_id = 'none';

   if ($type eq 'transcript'){
      foreach my $blah (@annotation){
         my ($type,$name) = split(/\s+/,$blah);
         if ($type eq 'transcript_id'){
            $current_transcript = $name;
            $current_transcript =~ s/"//g;
            $transcript_start = $start;
            $transcript_end = $end;
         }
      }
      if ($current_transcript eq 'none'){
         die "No name for entry $.\n";
      }
   }

   if ($type eq 'UTR'){
      my $region = '';
      if ($strand eq '+'){
         my $dis_to_start = abs($start - $transcript_start);
         my $dis_to_end = abs($start - $transcript_end);
         $region = $dis_to_start < $dis_to_end ? '5_UTR' : '3_UTR';
      } else {
         my $dis_to_start = abs($end - $transcript_end);
         my $dis_to_end = abs($end - $transcript_start);
         $region = $dis_to_start < $dis_to_end ? '5_UTR' : '3_UTR';
      }
      print join ("\t", $chr, $start, $end, $region, $current_transcript, $strand),"\n";
   }
}
close(IN);

exit(0);

To run the script:

print_utr.pl gencode.v19.annotation.gtf.gz | gzip > transcript_utr.bed.gz
zcat transcript_utr.bed.gz | head
chr1    70006   70008   3_UTR   ENST00000335137.3       +
chr1    137621  138532  5_UTR   ENST00000423372.3       -
chr1    139310  139379  5_UTR   ENST00000423372.3       -
chr1    134901  135802  3_UTR   ENST00000423372.3       -
chr1    367640  367658  5_UTR   ENST00000426406.1       +
chr1    368595  368634  3_UTR   ENST00000426406.1       +
chr1    621059  621098  3_UTR   ENST00000332831.2       -
chr1    622035  622053  5_UTR   ENST00000332831.2       -
chr1    819981  819983  3_UTR   ENST00000594233.1       +
chr1    860260  860328  5_UTR   ENST00000420190.1       +

Basically the script gets the coordinates of the transcript, and whenever there is an UTR annotation, it checks to see if the UTR coordinates are closer to the start or the end of the transcript.

Promoter region

Here's a script that reads the GENCODE annotation file, obtains the starting positions of each transcript, adds some user defined padding around this starting position and outputs it as a bed file. The starting position plus some padding is defined as the promoter region. Note that this script assumes that the hg19.genome file is available in the same directory as the script.

#!/bin/env perl

use strict;
use warnings;

my $usage = "Usage: $0 <infile.annotation.gtf> <padding>\n";
my $infile = shift or die $usage;
my $span = shift or die $usage;

if ($span !~ /^\d+$/){
   die "Please enter a numeric value for the padding\n";
}

my $hg19 = 'hg19.genome';
my %hg19 = ();

open(IN,'<',$hg19) || die "Could not open $hg19: $!\n";
while(<IN>){
   chomp;
   #chr9_gl000201_random    36148
   my ($chr, $end) = split(/\t/);
   $hg19{$chr} = $end;
}
close(IN);

if ($infile =~ /\.gz/){
   open(IN,'-|',"gunzip -c $infile") || die "Could not open $infile: $!\n";
} else {
   open(IN,'<',$infile) || die "Could not open $infile: $!\n";
}

while(<IN>){
   chomp;
   next if (/^#/);
   #chr11   HAVANA  transcript      65265233        65273940        .       +       .       gene_id "ENSG00000251562.3"; transcript_id "ENST00000534336.1"; gene_type "processed_transcript"; gene_status "KNOWN"; gene_name "MALAT1"; transcript_type "non_coding"; transcript_status "KNOWN"; transcript_name "MALAT1-001"; level 2; havana_gene "OTTHUMG00000166322.1"; havana_transcript "OTTHUMT00000389143.1";
   my ($chr,$source,$type,$start,$end,$score,$strand,$phase,$annotation) = split(/\t/);
   next unless $type eq 'transcript';
   my @annotation = split(/;\s/,$annotation);
   my $transcript_id = 'none';
   foreach my $blah (@annotation){
      my ($type,$name) = split(/\s+/,$blah);
      if ($type eq 'transcript_id'){
         $transcript_id = $name;
         $transcript_id =~ s/"//g;
      }
   }
   if ($transcript_id eq 'none'){
      die "No name for entry $.\n";
   }
   my $promoter_start = '';
   my $promoter_end = '';
   if ($strand eq '+'){
      $promoter_start = $start - $span;
      $promoter_end = $start + $span;
   } else {
      $promoter_start = $end - $span;
      $promoter_end = $end + $span;
   }
   if ($promoter_start < 0){
      warn "Adjusted promoter start to 0\n";
      $promoter_start = 0;
   } elsif ($promoter_end > $hg19{$chr}){
      warn "Adjusted promoter end to $hg19{$chr}\n";
      $promoter_end = $hg19{$chr};
   }
   print join("\t",$chr,$promoter_start,$promoter_end,$transcript_id,0,$strand),"\n";
}
close(IN);

exit(0);

Running the script:

promoter.pl gencode.v19.annotation.gtf.gz 200  | head
chr1    11669   12069   ENST00000456328.2       0       +
chr1    11672   12072   ENST00000515242.2       0       +
chr1    11674   12074   ENST00000518655.2       0       +
chr1    11810   12210   ENST00000450305.2       0       +
chr1    29170   29570   ENST00000438504.2       0       -
chr1    24686   25086   ENST00000541675.1       0       -
chr1    29170   29570   ENST00000423562.1       0       -
chr1    29370   29770   ENST00000488147.1       0       -
chr1    29606   30006   ENST00000538476.1       0       -
chr1    29354   29754   ENST00000473358.1       0       +

See also

See the BEDTools documentation for cool usage examples.

All code used in this post is available at https://github.com/davetang/defining_genomic_regions.

Print Friendly, PDF & Email



Creative Commons License
This work is licensed under a Creative Commons
Attribution 4.0 International License
.
29 comments Add yours
  1. Hi Dave,

    Great resources and tutorials to extract the genomic regions from the GENCODE file.

    Here, I want to look into the 3’UTR and 5’UTR regions in the gencode gtf file
    altough the gtf file has a thrid column containg UTR info, it does not seperate the 3’UTR and 5’UTR.

    Moreover, in UCSC browser, it offers us the 3’UTR&5’UTR bed file for V14 gencode.
    So my question is how can we find this info from the gencode file from http://www.gencodegenes.org/releases/17.html, i.e., V17, the latest version gtf file?

    Thanks

    1. Hi Asaki,

      Thank you for the compliment. As for the UTRs, since we know the orientation of the gene/transcript, from the gtf file, we know whether the UTR is at the 5′ or 3′ end.

      Bioconductor contains GENCODE genes, but it is not the latest version (since it’s derived from the UCSC Genome Browser).

      It wouldn’t be too difficult to write a Perl script that extracts UTRs based on the gene orientation but some people are against writing custom scripts. I’m fine with custom scripts as long as they work as expected and are shared.

      Cheers,

      Dave

  2. How about doing the same thing on a plant genome or for other non model organism for which something like gencode is not available.

        1. Hi Saad,

          I downloaded the file and had a look; it has annotations for:

          CDS
          five_prime_UTR
          gene
          mRNA
          three_prime_UTR

          It seems the UTRs are defined already?

          Cheers,

          Dave

          1. Hi Dave,

            I was more interested to generate a file similar to what UCSC generates. Unfortunately my organism is not there in UCSC browser.

            This is the kind of output I am looking for. Do you think if there are already tools to do that or do I have to write a script for it myself.

            output by ucsc : –

            chr1 14969 15038 NR_024540_utr3_1_0_chr1_14970_r 0 – NR_024540 WASH7P 653635 pseudogene

  3. Thank you very much for the above executed scripts, i was able to cross check my ouput with that of bedtools. I have no transcript info. All i have is exon and cds.The difference lowest start point of cds and lowest start point of exon will be my 5’UTR and difference in highest will be my 3’UTR.
    The following is my GFF file can you please help me extract the UTR regions?
    Ca1 Gnomon gene 254 741 . – . ID=gene0;Name=LOC101497325;Dbxref=GeneID:101497325;gbkey=Gene;gene=LOC101497325
    Ca1 Gnomon mRNA 254 741 . – . ID=rna0;Name=XM_004486173.1;Parent=gene0;Dbxref=GeneID:101497325,Genbank:XM_004486173.1;gbkey=mRNA;gene=LOC101497325;product=type II inositol 1%2C4%2C5-trisphosphate 5-phosphatase FRA3-like;transcript_id=XM_004486173.1
    Ca1 Gnomon exon 602 741 . – . ID=id1;Parent=rna0;Dbxref=GeneID:101497325,Genbank:XM_004486173.1;gbkey=mRNA;gene=LOC101497325;product=type II inositol 1%2C4%2C5-trisphosphate 5-phosphatase FRA3-like;transcript_id=XM_004486173.1
    Ca1 Gnomon exon 254 506 . – . ID=id2;Parent=rna0;Dbxref=GeneID:101497325,Genbank:XM_004486173.1;gbkey=mRNA;gene=LOC101497325;product=type II inositol 1%2C4%2C5-trisphosphate 5-phosphatase FRA3-like;transcript_id=XM_004486173.1
    Ca1 Gnomon CDS 602 663 . – 0 ID=cds0;Name=XP_004486230.1;Parent=rna0;Dbxref=GeneID:101497325,Genbank:XP_004486230.1;gbkey=CDS;product=type II inositol 1%2C4%2C5-trisphosphate 5-phosphatase FRA3-like;protein_id=XP_004486230.1
    Ca1 Gnomon CDS 254 506 . – 1 ID=cds0;Name=XP_004486230.1;Parent=rna0;Dbxref=GeneID:101497325,Genbank:XP_004486230.1;gbkey=CDS;product=type II inositol 1%2C4%2C5-trisphosphate 5-phosphatase FRA3-like;protein_id=XP_004486230.1
    Ca1 Gnomon gene 18089 20552 . – . ID=gene1;Name=LOC101488339;Dbxref=GeneID:101488339;gbkey=Gene;gene=LOC101488339

    Ca1 Gnomon gene 165984 168949 . + . ID=gene10;Name=LOC101493009;Dbxref=GeneID:101493009;gbkey=Gene;gene=LOC101493009
    Ca1 Gnomon mRNA 165984 168949 . + . ID=rna18;Name=XM_004485362.1;Parent=gene10;Dbxref=GeneID:101493009,Genbank:XM_004485362.1;gbkey=mRNA;gene=LOC101493009;product=uncharacterized LOC101493009;transcript_id=XM_004485362.1
    Ca1 Gnomon exon 165984 166302 . + . ID=id169;Parent=rna18;Dbxref=GeneID:101493009,Genbank:XM_004485362.1;gbkey=mRNA;gene=LOC101493009;product=uncharacterized LOC101493009;transcript_id=XM_004485362.1
    Ca1 Gnomon exon 167289 168949 . + . ID=id170;Parent=rna18;Dbxref=GeneID:101493009,Genbank:XM_004485362.1;gbkey=mRNA;gene=LOC101493009;product=uncharacterized LOC101493009;transcript_id=XM_004485362.1
    Ca1 Gnomon CDS 167476 168681 . + 0 ID=cds17;Name=XP_004485419.1;Parent=rna18;Dbxref=GeneID:101493009,Genbank:XP_004485419.1;gbkey=CDS;product=uncharacterized protein LOC101493009;protein_id=XP_004485419.1

    1. Hi Alok,

      if I understood you correctly, then the 5′ UTR for LOC101497325 is Ca1:663-741 and the 3′ UTR is Ca1:254-506 and for LOC101493009 the 5′ UTR is Ca1:165984-166302 and the 3′ UTR is Ca1:168681-168949.

      Cheers,

      Dave

  4. Thanks Dave for this very nice post.
    I found a small in the conversion of GTF format to BED format. GTF has 1-based start coordinates, and BED has 0-based start coordinates. Thus the awk command to convert should for example be:

    zcat gencode.v$v.annotation.gtf.gz |
    awk ‘BEGIN{OFS=”\t”;} $3==”exon” {print $1,$4-1,$5}’
    etc.

    Thanks
    Julien

    1. Thanks Julien! This off-by-one error caused by 0- and 1-based coordinates is probably the most common bioinformatic error 🙂

  5. Hi Dave,

    nice piece of work, thanks. Wondering about your thoughts on how to avoid issues when merging the exons due to overlapping genes on each strand. How do you then know which gene/transcript is aligned to?

    Thanks,

    Bruce.

    1. Hi Bruce,

      thanks for the comment. This post was definitely simplifying the task of annotating mapped reads, which as you point out can get quite complicated (and messy).

      One approach a colleague took, was to create a table for mapped reads (or clusters of reads), where each column was a genomic property, e.g. antisense to promoter, sense to exon, etc., and each column would be filled with the transcript ID that fulfilled that property. At the end, he could be able to tell the properties of a read. Needless to say he didn’t merge exons, like I did, or else he would lose information.

      Cheers,

      Dave

  6. Thanks a lot, Dave.
    It’s a wonderful post. While when I download gencode.v20.annotation.gtf.gz and so forth, it gives the error messages as below,
    ====
    Checking and Creating Intergenic Regions
    Warning: end of BED entry exceeds chromosome length. Please correct.
    ====
    I also check the entry count for each output.
    ====
    152320 gencode.v19.annotation.gtf.gz
    8543 gencode_v19_exon_merged.bed.gz
    575 gencode_v19_intergenic.bed.gz
    6805 gencode_v19_intron.bed.gz
    8060 promoter.bed.gz
    8243 transcript_utr.bed.gz
    3292 transcript_utr_number.out.gz
    ====
    158000 gencode.v23.annotation.gtf.gz
    8116 gencode_v23_exon_merged.bed.gz
    187 gencode_v23_intergenic.bed.gz
    6998 gencode_v23_intron.bed.gz
    8731 promoter.bed.gz
    8234 transcript_utr.bed.gz
    3068 transcript_utr_number.out.gz
    ====

    The gencode_v23_intergenic.bed.gz is much smaller than gencode_v19_intergenic.bed.gz, is it correct output?

    Thanks a lot for your comment.

    1. Hi Ben,

      GENCODE version 19 is the last version of GENCODE that uses the hg19 reference genome; that’s why you’re getting the warnings (a good thing!).

      In the MySQL command, change it to: “select chrom, size from hg38.chromInfo” and save the output as hg38.genome and change the workflow accordingly.

      Cheers,

      Dave

  7. Hi Dave,
    Very nice post. All goes well except when I try to isolate intergenic regions. I get an error: Less than the req’d two fields were encountered in the genome file. May I please request you to help me!

    1. Can you confirm that your genome file looks something like this?


      cat hg19.genome | head
      chrom size
      chr1 249250621
      chr2 243199373
      chr3 198022430
      chr4 191154276
      chr5 180915260
      chr6 171115067
      chr7 159138663
      chrX 155270560
      chr8 146364022

      1. Hi Dave,
        Thanks so much. It works very well now!
        I was put Ca1 instead refseq Sequence NC_021160.1.
        Chromosome Size
        NC_021160.1 48359943
        NC_021161.1 36634854
        NC_021162.1 39989001
        NC_021163.1 49191682
        NC_021164.1 48169137
        NC_021165.1 59463898
        NC_021166.1 48961560
        NC_021167.1 16477302

      1. Hi Dave,
        Really good post…..I have created intron.bed file but for intergenic region I am getting error:
        Error: The genome file gen.size has no valid entries. Exiting.
        I have created this file from (ftp://ftp.solgenomics.net/tomato_genome/assembly/current_build/stats_chromosomes.pdf) pdf and single space delimiter used.
        chrom size
        chr0 21805821
        chr1 90304244
        chr2 49918294
        chr3 64840714
        chr4 64064312
        chr5 65021438
        chr6 46041636
        chr7 65268621
        chr8 63032657
        chr9 67662091
        chr10 64834305
        chr11 53386025
        chr12 65486253
        even I tried tab as delimiter but it gave this error:
        ***** ERROR: illegal character ‘ ‘ found in integer conversion of string ” 21805821″. Exiting…
        chrom size
        chr0 21805821
        chr1 90304244
        chr2 49918294
        chr3 64840714
        chr4 64064312
        chr5 65021438
        chr6 46041636
        chr7 65268621
        chr8 63032657
        chr9 67662091
        chr10 64834305
        chr11 53386025
        chr12 65486253

        1. Copy text from PDFs can sometimes introduce unsupported characters, which may be what the “illegal character” warning is about. Try to copy the coordinates from a plain text file instead.

  8. Hi Dave,

    Thanks for your great post.

    Just to check – does the intergenic regions created here include or exclude introns?

    Cheers

  9. Hi Dave,

    Can you comment on how to select the “padding” for promoter assignment? What would be a reasonable padding for hg38? Are there any good criteria to choose it?

    Thanks,
    Siyu

  10. Hi Dave,

    Thanks for the prompt reply and additional valuable resources! This is super helpful~

    Happy holidays~

    Best,
    Siyu

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.