Defining genomic regions

genomicsDavoJanuary 18, 201329

Updated 2014 June 24th to use GENCODE version 19

RNA sequencing (RNA-Seq) reads are typically mapped back to the genome (or transcriptome in some cases) after sequencing. The next task is to annotate the reads, to see which regions the reads mapped to. Typically one creates an annotation file and compares the coordinates of the mapped reads to the annotation file. Creating this annotation file is quite easy using BEDTools; in this post I refer to the creation of the annotation file as defining genomic regions, since in the end I will have several files that contain coordinates of exonic, intronic, and intergenic regions. I will define these regions with respect to GENCODE annotations.

First let's install the latest version of BEDTools:

git clone https://github.com/arq5x/bedtools2.git
cd bedtools2
make clean; make all
bin/bedtools --version
#bedtools v2.20.1-4-gb877b35
cd ..

Now download the GENCODE version 19 (which is currently the latest update):

v=19
wget ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_$v/gencode.v$v.annotation.gtf.gz

#what does this file look like?
zcat gencode.v19.annotation.gtf.gz | head
##description: evidence-based annotation of the human genome (GRCh37), version 19 (Ensembl 74)
##provider: GENCODE
##contact: gencode@sanger.ac.uk
##format: gtf
##date: 2013-12-05
chr1    HAVANA  gene    11869   14412   .       +       .       gene_id "ENSG00000223972.4"; transcript_id "ENSG00000223972.4"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type "pseudogene"; transcript_status "KNOWN"; transcript_name "DDX11L1"; level 2; havana_gene "OTTHUMG00000000961.2";
chr1    HAVANA  transcript      11869   14409   .       +       .       gene_id "ENSG00000223972.4"; transcript_id "ENST00000456328.2"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_status "KNOWN"; transcript_name "DDX11L1-002"; level 2; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";
chr1    HAVANA  exon    11869   12227   .       +       .       gene_id "ENSG00000223972.4"; transcript_id "ENST00000456328.2"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_status "KNOWN"; transcript_name "DDX11L1-002"; exon_number 1;  exon_id "ENSE00002234944.1";  level 2; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";
chr1    HAVANA  exon    12613   12721   .       +       .       gene_id "ENSG00000223972.4"; transcript_id "ENST00000456328.2"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_status "KNOWN"; transcript_name "DDX11L1-002"; exon_number 2;  exon_id "ENSE00003582793.1";  level 2; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";
chr1    HAVANA  exon    13221   14409   .       +       .       gene_id "ENSG00000223972.4"; transcript_id "ENST00000456328.2"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_status "KNOWN"; transcript_name "DDX11L1-002"; exon_number 3;  exon_id "ENSE00002312635.1";  level 2; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";

In the third column of the GTF file corresponds to the feature type. How many different types of features are there?

zcat gencode.v19.annotation.gtf.gz | grep -v "^#" | cut -f3 | sort | uniq -c | sort -k1rn
1196293 exon
 723784 CDS
 284573 UTR
 196520 transcript
  84144 start_codon
  76196 stop_codon
  57820 gene
    114 Selenocysteine

So the exonic regions are already defined; however, I would like to remove overlapping exons. The mergeBed utility in BEDTools can accomplish this; however before merging make sure the coordinates are sorted. We can use sortBed to sort the BED file:

v=19
zcat gencode.v$v.annotation.gtf.gz |
awk 'BEGIN{OFS="\t";} $3=="exon" {print $1,$4-1,$5}' |
bedtools2/bin/sortBed |
bedtools2/bin/mergeBed -i - | gzip > gencode_v${v}_exon_merged.bed.gz

To define intronic regions, we just need to subtract the exonic region from the genic region. The utility subtractBed can do this:

v=19
zcat gencode.v$v.annotation.gtf.gz |
awk 'BEGIN{OFS="\t";} $3=="gene" {print $1,$4-1,$5}' |
bedtools2/bin/sortBed |
bedtools2/bin/subtractBed -a stdin -b gencode_v${v}_exon_merged.bed.gz |
gzip > gencode_v${v}_intron.bed.gz

#let's intersect the two files
#this shouldn't produce any output
intersectBed -a gencode_v${v}_exon_merged.bed.gz -b gencode_v${v}_intron.bed.gz

And finally to define intergenic regions, we use complementBed to find regions not covered by genes.

v=19
mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -e \
        "select chrom, size from hg19.chromInfo"  > hg19.genome

zcat gencode.v$v.annotation.gtf.gz |
awk 'BEGIN{OFS="\t";} $3=="gene" {print $1,$4-1,$5}' |
bedtools2/bin/sortBed |
bedtools2/bin/complementBed -i stdin -g hg19.genome |
gzip > gencode_v${v}_intergenic.bed.gz

And that's it.

Here's a schematic showing the steps we performed above.

Genome coverage

How much of the genome does exons, introns, and intergenic regions cover?

#!/bin/env perl

use strict;
use warnings;

my $v = 19;

my $exon_file       = "gencode_v${v}_exon_merged.bed.gz";
my $intergenic_file = "gencode_v${v}_intergenic.bed.gz";
my $intron_file     = "gencode_v${v}_intron.bed.gz";

my $exon_coverage       = coverage($exon_file);
my $intergenic_coverage = coverage($intergenic_file);
my $intron_coverage     = coverage($intron_file);

my $total = $exon_coverage + $intergenic_coverage + $intron_coverage;

printf "Exon: %.2f\n", $exon_coverage*100/$total;
printf "Intron: %.2f\n", $intron_coverage*100/$total;
printf "Intergenic: %.2f\n", $intergenic_coverage*100/$total;

sub coverage {
   my ($infile) = @_;
   my $coverage = 0;
   open(IN,'-|',"zcat $infile") || die "Could not open $infile: $!\n";
   while(<IN>){
      chomp;
      my ($chr, $start, $end) = split(/\t/);
      my $c = $end - $start;
      $coverage += $c;
   }
   close(IN);
   return($coverage);
}

exit(0);

Running this script:

coverage.pl
Exon: 3.72
Intron: 48.25
Intergenic: 48.03

I had initially thought that the intergenic region would have the highest percent coverage in the genome compared to introns and exons.

The untranslated region (UTR)

I got a question (see comments) regarding the untranslated region (UTR). To illustrate my answer, I'll use the transcript ENST00000335295 as an example:

v=19
zcat gencode.v$v.annotation.gtf.gz | grep ENST00000335295
chr11   HAVANA  transcript      5246694 5248301 .       -       .       gene_id "ENSG00000244734.2"; transcript_id "ENST00000335295.4"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "HBB"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "HBB-001"; level 2; tag "basic"; tag "appris_principal"; tag "CCDS"; ccdsid "CCDS7753.1"; havana_gene "OTTHUMG00000066678.3"; havana_transcript "OTTHUMT00000142977.2";
chr11   HAVANA  exon    5248160 5248301 .       -       .       gene_id "ENSG00000244734.2"; transcript_id "ENST00000335295.4"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "HBB"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "HBB-001"; exon_number 1;  exon_id "ENSE00001829867.2";  level 2; tag "basic"; tag "appris_principal"; tag "CCDS"; ccdsid "CCDS7753.1"; havana_gene "OTTHUMG00000066678.3"; havana_transcript "OTTHUMT00000142977.2";
chr11   HAVANA  CDS     5248160 5248251 .       -       0       gene_id "ENSG00000244734.2"; transcript_id "ENST00000335295.4"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "HBB"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "HBB-001"; exon_number 1;  exon_id "ENSE00001829867.2";  level 2; tag "basic"; tag "appris_principal"; tag "CCDS"; ccdsid "CCDS7753.1"; havana_gene "OTTHUMG00000066678.3"; havana_transcript "OTTHUMT00000142977.2";
chr11   HAVANA  start_codon     5248249 5248251 .       -       0       gene_id "ENSG00000244734.2"; transcript_id "ENST00000335295.4"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "HBB"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "HBB-001"; exon_number 1;  exon_id "ENSE00001829867.2";  level 2; tag "basic"; tag "appris_principal"; tag "CCDS"; ccdsid "CCDS7753.1"; havana_gene "OTTHUMG00000066678.3"; havana_transcript "OTTHUMT00000142977.2";
chr11   HAVANA  exon    5247807 5248029 .       -       .       gene_id "ENSG00000244734.2"; transcript_id "ENST00000335295.4"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "HBB"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "HBB-001"; exon_number 2;  exon_id "ENSE00001057381.1";  level 2; tag "basic"; tag "appris_principal"; tag "CCDS"; ccdsid "CCDS7753.1"; havana_gene "OTTHUMG00000066678.3"; havana_transcript "OTTHUMT00000142977.2";
chr11   HAVANA  CDS     5247807 5248029 .       -       1       gene_id "ENSG00000244734.2"; transcript_id "ENST00000335295.4"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "HBB"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "HBB-001"; exon_number 2;  exon_id "ENSE00001057381.1";  level 2; tag "basic"; tag "appris_principal"; tag "CCDS"; ccdsid "CCDS7753.1"; havana_gene "OTTHUMG00000066678.3"; havana_transcript "OTTHUMT00000142977.2";
chr11   HAVANA  exon    5246694 5246956 .       -       .       gene_id "ENSG00000244734.2"; transcript_id "ENST00000335295.4"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "HBB"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "HBB-001"; exon_number 3;  exon_id "ENSE00001600613.2";  level 2; tag "basic"; tag "appris_principal"; tag "CCDS"; ccdsid "CCDS7753.1"; havana_gene "OTTHUMG00000066678.3"; havana_transcript "OTTHUMT00000142977.2";
chr11   HAVANA  CDS     5246831 5246956 .       -       0       gene_id "ENSG00000244734.2"; transcript_id "ENST00000335295.4"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "HBB"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "HBB-001"; exon_number 3;  exon_id "ENSE00001600613.2";  level 2; tag "basic"; tag "appris_principal"; tag "CCDS"; ccdsid "CCDS7753.1"; havana_gene "OTTHUMG00000066678.3"; havana_transcript "OTTHUMT00000142977.2";
chr11   HAVANA  stop_codon      5246828 5246830 .       -       0       gene_id "ENSG00000244734.2"; transcript_id "ENST00000335295.4"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "HBB"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "HBB-001"; exon_number 3;  exon_id "ENSE00001600613.2";  level 2; tag "basic"; tag "appris_principal"; tag "CCDS"; ccdsid "CCDS7753.1"; havana_gene "OTTHUMG00000066678.3"; havana_transcript "OTTHUMT00000142977.2";
chr11   HAVANA  UTR     5248252 5248301 .       -       .       gene_id "ENSG00000244734.2"; transcript_id "ENST00000335295.4"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "HBB"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "HBB-001"; level 2; tag "basic"; tag "appris_principal"; tag "CCDS"; ccdsid "CCDS7753.1"; havana_gene "OTTHUMG00000066678.3"; havana_transcript "OTTHUMT00000142977.2";
chr11   HAVANA  UTR     5246694 5246830 .       -       .       gene_id "ENSG00000244734.2"; transcript_id "ENST00000335295.4"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "HBB"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "HBB-001"; level 2; tag "basic"; tag "appris_principal"; tag "CCDS"; ccdsid "CCDS7753.1"; havana_gene "OTTHUMG00000066678.3"; havana_transcript "OTTHUMT00000142977.2";

This transcript (HBB) starts at 5248301 and ends at 5246694 on chromosome 11 (the coordinates are reversed because it's on the negative strand). We see two lines with the UTR feature for this transcript, one starting at 5248301 and the other at 5246830.

So the coordinates for the 5' UTR are chr11:5248252-5248301 and the 3' UTR are chr11:5246694-5246830. Perhaps I should have picked a transcript that's on the positive strand, however this was one example I knew where the transcript is short and well defined (contains the start and end codons, and the 5' and 3' UTRs). Which now leads me to wonder, how many transcripts have defined UTRs?

#!/bin/env perl

use strict;
use warnings;

my $usage = "Usage: $0 <infile.annotation.gtf.gz>\n";
my $infile = shift or die $usage;

if ($infile =~ /\.gz/){
   open(IN,'-|',"gunzip -c $infile") || die "Could not open $infile: $!\n";
} else {
   open(IN,'<',$infile) || die "Could not open $infile: $!\n";
}

my %transcript = ();
my $current_transcript = '';

while(<IN>){
   chomp;
   next if (/^#/);
   #chr11   HAVANA  transcript      65265233        65273940        .       +       .       gene_id "ENSG00000251562.3"; transcript_id "ENST00000534336.1"; gene_type "processed_transcript"; gene_status "KNOWN"; gene_name "MALAT1"; transcript_type "non_coding"; transcript_status "KNOWN"; transcript_name "MALAT1-001"; level 2; havana_gene "OTTHUMG00000166322.1"; havana_transcript "OTTHUMT00000389143.1";
   my ($chr,$source,$type,$start,$end,$score,$strand,$phase,$annotation) = split(/\t/);

   my @annotation = split(/;\s/,$annotation);
   my $transcript_id = 'none';

   if ($type eq 'transcript'){
      foreach my $blah (@annotation){
         my ($type,$name) = split(/\s+/,$blah);
         if ($type eq 'transcript_id'){
            $current_transcript = $name;
            $current_transcript =~ s/"//g;
            $transcript{$current_transcript} = 0;
         }
      }
      if ($current_transcript eq 'none'){
         die "No name for entry $.\n";
      }
   }

   if ($type eq 'UTR'){
      $transcript{$current_transcript}++;
   }
}
close(IN);

foreach my $transcript (keys %transcript){
   print "$transcript\t$transcript{$transcript}\n";
}

exit(0);

Now to run the script:

check_utr.pl gencode.v19.annotation.gtf.gz | gzip > transcript_utr_number.out.gz

zcat transcript_utr_number.out | wc -l
196520

zcat transcript_utr_number.out | cut -f2 | sort | uniq -c | sort -k1rn | head
 103800 0
  34999 2
  24234 3
  11543 1
  10352 4
   4493 5
   2264 6
   1297 7
    796 8
    583 9

zcat transcript_utr_number.out | awk '$2==57 {print}'
ENST00000264319.7       57

Running the script above, I find that over half of the GENCODE transcripts (103,800/196,520) don't have a defined UTRs. 34,999 transcripts have 2 defined UTRs, 24,234 have 3 because one of the UTR is spliced. The transcript ENST00000264319.7 had 57 UTR lines, which was the most UTR annotations for a transcript.

Defining the 5' and 3' UTRs

I'm not entirely sure why the UTRs weren't separated into 5' and 3', especially when GENCODE went through the trouble of labelling the start_codon and the stop_codon. I'm going to assume that if a UTR is closer to the starting position of the transcript it's the 5' UTR and vice versa.

#!/bin/env perl

use strict;
use warnings;

my $usage = "Usage: $0 <infile.annotation.gtf>\n";
my $infile = shift or die $usage;

if ($infile =~ /\.gz/){
   open(IN,'-|',"gunzip -c $infile") || die "Could not open $infile: $!\n";
} else {
   open(IN,'<',$infile) || die "Could not open $infile: $!\n";
}

my %transcript = ();
my $current_transcript = '';
my $transcript_start = 0;
my $transcript_end = 0;

while(<IN>){
   chomp;
   next if (/^#/);
   #chr11   HAVANA  transcript      65265233        65273940        .       +       .       gene_id "ENSG00000251562.3"; transcript_id "ENST00000534336.1"; gene_type "processed_transcript"; gene_status "KNOWN"; gene_name "MALAT1"; transcript_type "non_coding"; transcript_status "KNOWN"; transcript_name "MALAT1-001"; level 2; havana_gene "OTTHUMG00000166322.1"; havana_transcript "OTTHUMT00000389143.1";
   my ($chr,$source,$type,$start,$end,$score,$strand,$phase,$annotation) = split(/\t/);

   my @annotation = split(/;\s/,$annotation);
   my $transcript_id = 'none';

   if ($type eq 'transcript'){
      foreach my $blah (@annotation){
         my ($type,$name) = split(/\s+/,$blah);
         if ($type eq 'transcript_id'){
            $current_transcript = $name;
            $current_transcript =~ s/"//g;
            $transcript_start = $start;
            $transcript_end = $end;
         }
      }
      if ($current_transcript eq 'none'){
         die "No name for entry $.\n";
      }
   }

   if ($type eq 'UTR'){
      my $region = '';
      if ($strand eq '+'){
         my $dis_to_start = abs($start - $transcript_start);
         my $dis_to_end = abs($start - $transcript_end);
         $region = $dis_to_start < $dis_to_end ? '5_UTR' : '3_UTR';
      } else {
         my $dis_to_start = abs($end - $transcript_end);
         my $dis_to_end = abs($end - $transcript_start);
         $region = $dis_to_start < $dis_to_end ? '5_UTR' : '3_UTR';
      }
      print join ("\t", $chr, $start, $end, $region, $current_transcript, $strand),"\n";
   }
}
close(IN);

exit(0);

To run the script:

print_utr.pl gencode.v19.annotation.gtf.gz | gzip > transcript_utr.bed.gz
zcat transcript_utr.bed.gz | head
chr1    70006   70008   3_UTR   ENST00000335137.3       +
chr1    137621  138532  5_UTR   ENST00000423372.3       -
chr1    139310  139379  5_UTR   ENST00000423372.3       -
chr1    134901  135802  3_UTR   ENST00000423372.3       -
chr1    367640  367658  5_UTR   ENST00000426406.1       +
chr1    368595  368634  3_UTR   ENST00000426406.1       +
chr1    621059  621098  3_UTR   ENST00000332831.2       -
chr1    622035  622053  5_UTR   ENST00000332831.2       -
chr1    819981  819983  3_UTR   ENST00000594233.1       +
chr1    860260  860328  5_UTR   ENST00000420190.1       +

Basically the script gets the coordinates of the transcript, and whenever there is an UTR annotation, it checks to see if the UTR coordinates are closer to the start or the end of the transcript.

Promoter region

Here's a script that reads the GENCODE annotation file, obtains the starting positions of each transcript, adds some user defined padding around this starting position and outputs it as a bed file. The starting position plus some padding is defined as the promoter region. Note that this script assumes that the hg19.genome file is available in the same directory as the script.

#!/bin/env perl

use strict;
use warnings;

my $usage = "Usage: $0 <infile.annotation.gtf> <padding>\n";
my $infile = shift or die $usage;
my $span = shift or die $usage;

if ($span !~ /^\d+$/){
   die "Please enter a numeric value for the padding\n";
}

my $hg19 = 'hg19.genome';
my %hg19 = ();

open(IN,'<',$hg19) || die "Could not open $hg19: $!\n";
while(<IN>){
   chomp;
   #chr9_gl000201_random    36148
   my ($chr, $end) = split(/\t/);
   $hg19{$chr} = $end;
}
close(IN);

if ($infile =~ /\.gz/){
   open(IN,'-|',"gunzip -c $infile") || die "Could not open $infile: $!\n";
} else {
   open(IN,'<',$infile) || die "Could not open $infile: $!\n";
}

while(<IN>){
   chomp;
   next if (/^#/);
   #chr11   HAVANA  transcript      65265233        65273940        .       +       .       gene_id "ENSG00000251562.3"; transcript_id "ENST00000534336.1"; gene_type "processed_transcript"; gene_status "KNOWN"; gene_name "MALAT1"; transcript_type "non_coding"; transcript_status "KNOWN"; transcript_name "MALAT1-001"; level 2; havana_gene "OTTHUMG00000166322.1"; havana_transcript "OTTHUMT00000389143.1";
   my ($chr,$source,$type,$start,$end,$score,$strand,$phase,$annotation) = split(/\t/);
   next unless $type eq 'transcript';
   my @annotation = split(/;\s/,$annotation);
   my $transcript_id = 'none';
   foreach my $blah (@annotation){
      my ($type,$name) = split(/\s+/,$blah);
      if ($type eq 'transcript_id'){
         $transcript_id = $name;
         $transcript_id =~ s/"//g;
      }
   }
   if ($transcript_id eq 'none'){
      die "No name for entry $.\n";
   }
   my $promoter_start = '';
   my $promoter_end = '';
   if ($strand eq '+'){
      $promoter_start = $start - $span;
      $promoter_end = $start + $span;
   } else {
      $promoter_start = $end - $span;
      $promoter_end = $end + $span;
   }
   if ($promoter_start < 0){
      warn "Adjusted promoter start to 0\n";
      $promoter_start = 0;
   } elsif ($promoter_end > $hg19{$chr}){
      warn "Adjusted promoter end to $hg19{$chr}\n";
      $promoter_end = $hg19{$chr};
   }
   print join("\t",$chr,$promoter_start,$promoter_end,$transcript_id,0,$strand),"\n";
}
close(IN);

exit(0);

Running the script:

promoter.pl gencode.v19.annotation.gtf.gz 200  | head
chr1    11669   12069   ENST00000456328.2       0       +
chr1    11672   12072   ENST00000515242.2       0       +
chr1    11674   12074   ENST00000518655.2       0       +
chr1    11810   12210   ENST00000450305.2       0       +
chr1    29170   29570   ENST00000438504.2       0       -
chr1    24686   25086   ENST00000541675.1       0       -
chr1    29170   29570   ENST00000423562.1       0       -
chr1    29370   29770   ENST00000488147.1       0       -
chr1    29606   30006   ENST00000538476.1       0       -
chr1    29354   29754   ENST00000473358.1       0       +

asaki says:

July 22, 2013 at 22:33

Hi Dave,

Great resources and tutorials to extract the genomic regions from the GENCODE file.

Here, I want to look into the 3’UTR and 5’UTR regions in the gencode gtf file
altough the gtf file has a thrid column containg UTR info, it does not seperate the 3’UTR and 5’UTR.

Moreover, in UCSC browser, it offers us the 3’UTR&5’UTR bed file for V14 gencode.
So my question is how can we find this info from the gencode file from http://www.gencodegenes.org/releases/17.html, i.e., V17, the latest version gtf file?

Thanks

Reply
1. Davo says:
  
  July 22, 2013 at 23:37
  
  Hi Asaki,
  
  Thank you for the compliment. As for the UTRs, since we know the orientation of the gene/transcript, from the gtf file, we know whether the UTR is at the 5′ or 3′ end.
  
  Bioconductor contains GENCODE genes, but it is not the latest version (since it’s derived from the UCSC Genome Browser).
  
  It wouldn’t be too difficult to write a Perl script that extracts UTRs based on the gene orientation but some people are against writing custom scripts. I’m fine with custom scripts as long as they work as expected and are shared.
  
  Cheers,
  
  Dave
  
  Reply
Pingback: Sequence conservation in vertebrates - Musings from a PhD candidate
Saad says:

February 2, 2014 at 22:26

How about doing the same thing on a plant genome or for other non model organism for which something like gencode is not available.

Reply
1. Davo says:
  
  February 2, 2014 at 23:12
  
  Yeah just use another set of transcript annotations and adapt the same concept.
  
  Reply
  1. Saad Khan says:
    
    February 14, 2014 at 00:04
    
    What if the organism does not have annotations of UTR and also how to get it in a format like gencode has. For example I have this file :-
    
    ftp://ftp.jgi-psf.org/pub/compgen/phytozome/v9.0/Gmax/annotation/Gmax_189_gene.gff3.gz
    
    what additional steps do I need to follow to get it into a format similar to gencode.
    
    Also gencode has lot of annotations that my file does not so your script does not seems to work with it.
    
    Reply
    1. Davo says:
      
      February 17, 2014 at 01:49
      
      Hi Saad,
      
      I downloaded the file and had a look; it has annotations for:
      
      CDS
      five_prime_UTR
      gene
      mRNA
      three_prime_UTR
      
      It seems the UTRs are defined already?
      
      Cheers,
      
      Dave
      
      Reply
      1. Saad Khan says:
        
        February 17, 2014 at 02:30
        
        Hi Dave,
        
        I was more interested to generate a file similar to what UCSC generates. Unfortunately my organism is not there in UCSC browser.
        
        This is the kind of output I am looking for. Do you think if there are already tools to do that or do I have to write a script for it myself.
        
        output by ucsc : –
        
        chr1 14969 15038 NR_024540_utr3_1_0_chr1_14970_r 0 – NR_024540 WASH7P 653635 pseudogene
        
        Reply
Alok says:

July 22, 2014 at 07:23

Thank you very much for the above executed scripts, i was able to cross check my ouput with that of bedtools. I have no transcript info. All i have is exon and cds.The difference lowest start point of cds and lowest start point of exon will be my 5’UTR and difference in highest will be my 3’UTR.
The following is my GFF file can you please help me extract the UTR regions?
Ca1 Gnomon gene 254 741 . – . ID=gene0;Name=LOC101497325;Dbxref=GeneID:101497325;gbkey=Gene;gene=LOC101497325
Ca1 Gnomon mRNA 254 741 . – . ID=rna0;Name=XM_004486173.1;Parent=gene0;Dbxref=GeneID:101497325,Genbank:XM_004486173.1;gbkey=mRNA;gene=LOC101497325;product=type II inositol 1%2C4%2C5-trisphosphate 5-phosphatase FRA3-like;transcript_id=XM_004486173.1
Ca1 Gnomon exon 602 741 . – . ID=id1;Parent=rna0;Dbxref=GeneID:101497325,Genbank:XM_004486173.1;gbkey=mRNA;gene=LOC101497325;product=type II inositol 1%2C4%2C5-trisphosphate 5-phosphatase FRA3-like;transcript_id=XM_004486173.1
Ca1 Gnomon exon 254 506 . – . ID=id2;Parent=rna0;Dbxref=GeneID:101497325,Genbank:XM_004486173.1;gbkey=mRNA;gene=LOC101497325;product=type II inositol 1%2C4%2C5-trisphosphate 5-phosphatase FRA3-like;transcript_id=XM_004486173.1
Ca1 Gnomon CDS 602 663 . – 0 ID=cds0;Name=XP_004486230.1;Parent=rna0;Dbxref=GeneID:101497325,Genbank:XP_004486230.1;gbkey=CDS;product=type II inositol 1%2C4%2C5-trisphosphate 5-phosphatase FRA3-like;protein_id=XP_004486230.1
Ca1 Gnomon CDS 254 506 . – 1 ID=cds0;Name=XP_004486230.1;Parent=rna0;Dbxref=GeneID:101497325,Genbank:XP_004486230.1;gbkey=CDS;product=type II inositol 1%2C4%2C5-trisphosphate 5-phosphatase FRA3-like;protein_id=XP_004486230.1
Ca1 Gnomon gene 18089 20552 . – . ID=gene1;Name=LOC101488339;Dbxref=GeneID:101488339;gbkey=Gene;gene=LOC101488339

Ca1 Gnomon gene 165984 168949 . + . ID=gene10;Name=LOC101493009;Dbxref=GeneID:101493009;gbkey=Gene;gene=LOC101493009
Ca1 Gnomon mRNA 165984 168949 . + . ID=rna18;Name=XM_004485362.1;Parent=gene10;Dbxref=GeneID:101493009,Genbank:XM_004485362.1;gbkey=mRNA;gene=LOC101493009;product=uncharacterized LOC101493009;transcript_id=XM_004485362.1
Ca1 Gnomon exon 165984 166302 . + . ID=id169;Parent=rna18;Dbxref=GeneID:101493009,Genbank:XM_004485362.1;gbkey=mRNA;gene=LOC101493009;product=uncharacterized LOC101493009;transcript_id=XM_004485362.1
Ca1 Gnomon exon 167289 168949 . + . ID=id170;Parent=rna18;Dbxref=GeneID:101493009,Genbank:XM_004485362.1;gbkey=mRNA;gene=LOC101493009;product=uncharacterized LOC101493009;transcript_id=XM_004485362.1
Ca1 Gnomon CDS 167476 168681 . + 0 ID=cds17;Name=XP_004485419.1;Parent=rna18;Dbxref=GeneID:101493009,Genbank:XP_004485419.1;gbkey=CDS;product=uncharacterized protein LOC101493009;protein_id=XP_004485419.1

Reply
1. Davo says:
  
  July 22, 2014 at 08:01
  
  Hi Alok,
  
  if I understood you correctly, then the 5′ UTR for LOC101497325 is Ca1:663-741 and the 3′ UTR is Ca1:254-506 and for LOC101493009 the 5′ UTR is Ca1:165984-166302 and the 3′ UTR is Ca1:168681-168949.
  
  Cheers,
  
  Dave
  
  Reply
julien says:

September 16, 2014 at 08:37

Thanks Dave for this very nice post.
I found a small in the conversion of GTF format to BED format. GTF has 1-based start coordinates, and BED has 0-based start coordinates. Thus the awk command to convert should for example be:

zcat gencode.v$v.annotation.gtf.gz |
awk ‘BEGIN{OFS=”\t”;} $3==”exon” {print $1,$4-1,$5}’
etc.

Thanks
Julien

Reply
1. Davo says:
  
  September 16, 2014 at 08:40
  
  Thanks Julien! This off-by-one error caused by 0- and 1-based coordinates is probably the most common bioinformatic error 🙂
  
  Reply
Bruce says:

November 26, 2014 at 16:48

Hi Dave,

nice piece of work, thanks. Wondering about your thoughts on how to avoid issues when merging the exons due to overlapping genes on each strand. How do you then know which gene/transcript is aligned to?

Thanks,

Bruce.

Reply
1. Davo says:
  
  November 27, 2014 at 05:33
  
  Hi Bruce,
  
  thanks for the comment. This post was definitely simplifying the task of annotating mapped reads, which as you point out can get quite complicated (and messy).
  
  One approach a colleague took, was to create a table for mapped reads (or clusters of reads), where each column was a genomic property, e.g. antisense to promoter, sense to exon, etc., and each column would be filled with the transcript ID that fulfilled that property. At the end, he could be able to tell the properties of a read. Needless to say he didn’t merge exons, like I did, or else he would lose information.
  
  Cheers,
  
  Dave
  
  Reply
Jay says:

March 27, 2015 at 11:52

Can you share your intergenicRegion.bed file?

Reply
BenT says:

October 26, 2015 at 08:12

Thanks a lot, Dave.
It’s a wonderful post. While when I download gencode.v20.annotation.gtf.gz and so forth, it gives the error messages as below,
====
Checking and Creating Intergenic Regions
Warning: end of BED entry exceeds chromosome length. Please correct.
====
I also check the entry count for each output.
====
152320 gencode.v19.annotation.gtf.gz
8543 gencode_v19_exon_merged.bed.gz
575 gencode_v19_intergenic.bed.gz
6805 gencode_v19_intron.bed.gz
8060 promoter.bed.gz
8243 transcript_utr.bed.gz
3292 transcript_utr_number.out.gz
====
158000 gencode.v23.annotation.gtf.gz
8116 gencode_v23_exon_merged.bed.gz
187 gencode_v23_intergenic.bed.gz
6998 gencode_v23_intron.bed.gz
8731 promoter.bed.gz
8234 transcript_utr.bed.gz
3068 transcript_utr_number.out.gz
====

The gencode_v23_intergenic.bed.gz is much smaller than gencode_v19_intergenic.bed.gz, is it correct output?

Thanks a lot for your comment.

Reply
1. Davo says:
  
  October 26, 2015 at 10:15
  
  Hi Ben,
  
  GENCODE version 19 is the last version of GENCODE that uses the hg19 reference genome; that’s why you’re getting the warnings (a good thing!).
  
  In the MySQL command, change it to: “select chrom, size from hg38.chromInfo” and save the output as hg38.genome and change the workflow accordingly.
  
  Cheers,
  
  Dave
  
  Reply
Mayank Kaashyap says:

July 27, 2016 at 06:04

Hi Dave,
Very nice post. All goes well except when I try to isolate intergenic regions. I get an error: Less than the req’d two fields were encountered in the genome file. May I please request you to help me!

Reply
1. Davo says:
  
  July 28, 2016 at 06:38
  
  Can you confirm that your genome file looks something like this?
  
  cat hg19.genome | head chrom size chr1 249250621 chr2 243199373 chr3 198022430 chr4 191154276 chr5 180915260 chr6 171115067 chr7 159138663 chrX 155270560 chr8 146364022
  
  Reply
  1. Mayank Kaashyap says:
    
    July 28, 2016 at 10:59
    
    Hi Dave,
    Thanks so much. It works very well now!
    I was put Ca1 instead refseq Sequence NC_021160.1.
    Chromosome Size
    NC_021160.1 48359943
    NC_021161.1 36634854
    NC_021162.1 39989001
    NC_021163.1 49191682
    NC_021164.1 48169137
    NC_021165.1 59463898
    NC_021166.1 48961560
    NC_021167.1 16477302
    
    Reply
Mayank Kaashyap says:

May 16, 2017 at 10:34

Hi Dave,
Thansk for this. How can I cite you.

Reply
1. Davo says:
  
  May 17, 2017 at 01:44
  
  Hi Mayank,
  
  you can use the URL of the post if you really want to cite me. Otherwise, just cite:
  
  https://www.ncbi.nlm.nih.gov/pubmed/20110278
  
  Cheers,
  
  Dave
  
  Reply
  1. BioinfoBaby says:
    
    January 4, 2019 at 10:18
    
    Hi Dave,
    Really good post…..I have created intron.bed file but for intergenic region I am getting error:
    Error: The genome file gen.size has no valid entries. Exiting.
    I have created this file from (ftp://ftp.solgenomics.net/tomato_genome/assembly/current_build/stats_chromosomes.pdf) pdf and single space delimiter used.
    chrom size
    chr0 21805821
    chr1 90304244
    chr2 49918294
    chr3 64840714
    chr4 64064312
    chr5 65021438
    chr6 46041636
    chr7 65268621
    chr8 63032657
    chr9 67662091
    chr10 64834305
    chr11 53386025
    chr12 65486253
    even I tried tab as delimiter but it gave this error:
    ***** ERROR: illegal character ‘ ‘ found in integer conversion of string ” 21805821″. Exiting…
    chrom size
    chr0 21805821
    chr1 90304244
    chr2 49918294
    chr3 64840714
    chr4 64064312
    chr5 65021438
    chr6 46041636
    chr7 65268621
    chr8 63032657
    chr9 67662091
    chr10 64834305
    chr11 53386025
    chr12 65486253
    
    Reply
    1. Davo says:
      
      January 23, 2019 at 08:29
      
      Copy text from PDFs can sometimes introduce unsupported characters, which may be what the “illegal character” warning is about. Try to copy the coordinates from a plain text file instead.
      
      Reply
cathal says:

May 14, 2019 at 08:00

Hi Dave,

Thanks for your great post.

Just to check – does the intergenic regions created here include or exclude introns?

Cheers

Reply
king says:

May 14, 2019 at 09:28

Hi,

When I try
‘wget ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_$v/gencode.v$v.annotation.gtf.gz‘
I get the message: ‘No such directory `pub/gencode/Gencode_human/release_’.’

Is this because of the new hg version? How would I adapt this pipeline for hg38?

Thank you.

Reply
Siyu Sun says:

December 15, 2021 at 16:19

Hi Dave,

Can you comment on how to select the “padding” for promoter assignment? What would be a reasonable padding for hg38? Are there any good criteria to choose it?

Thanks,
Siyu

Reply
1. Davo says:
  
  December 17, 2021 at 03:42
  
  Hi Siyu,
  
  for mammalian promoters, I use -300/+100, which is based on observations from CAGE data (see https://pubmed.ncbi.nlm.nih.gov/16645617/). If you are unfamiliar with CAGE, I have an old set of slides at http://davetang.github.io/cage_r/#1. I’m not so sure with non-mammalian promoters.
  
  Hope that helps,
  Dave
  
  Reply
Siyu Sun says:

December 17, 2021 at 15:31

Hi Dave,

Thanks for the prompt reply and additional valuable resources! This is super helpful~

Happy holidays~

Best,
Siyu

Reply

Defining genomic regions

Genome coverage

The untranslated region (UTR)

Defining the 5' and 3' UTRs

Promoter region

See also

Leave a Reply Cancel reply