Updated 2014 June 24th to use GENCODE version 19
RNA sequencing (RNA-Seq) reads are typically mapped back to the genome (or transcriptome in some cases) after sequencing. The next task is to annotate the reads, to see which regions the reads mapped to. Typically one creates an annotation file and compares the coordinates of the mapped reads to the annotation file. Creating this annotation file is quite easy using BEDTools; in this post I refer to the creation of the annotation file as defining genomic regions, since in the end I will have several files that contain coordinates of exonic, intronic, and intergenic regions. I will define these regions with respect to GENCODE annotations.
First let's install the latest version of BEDTools:
git clone https://github.com/arq5x/bedtools2.git cd bedtools2 make clean; make all bin/bedtools --version #bedtools v2.20.1-4-gb877b35 cd ..
Now download the GENCODE version 19 (which is currently the latest update):
v=19 wget ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_$v/gencode.v$v.annotation.gtf.gz #what does this file look like? zcat gencode.v19.annotation.gtf.gz | head ##description: evidence-based annotation of the human genome (GRCh37), version 19 (Ensembl 74) ##provider: GENCODE ##contact: gencode@sanger.ac.uk ##format: gtf ##date: 2013-12-05 chr1 HAVANA gene 11869 14412 . + . gene_id "ENSG00000223972.4"; transcript_id "ENSG00000223972.4"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type "pseudogene"; transcript_status "KNOWN"; transcript_name "DDX11L1"; level 2; havana_gene "OTTHUMG00000000961.2"; chr1 HAVANA transcript 11869 14409 . + . gene_id "ENSG00000223972.4"; transcript_id "ENST00000456328.2"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_status "KNOWN"; transcript_name "DDX11L1-002"; level 2; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1"; chr1 HAVANA exon 11869 12227 . + . gene_id "ENSG00000223972.4"; transcript_id "ENST00000456328.2"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_status "KNOWN"; transcript_name "DDX11L1-002"; exon_number 1; exon_id "ENSE00002234944.1"; level 2; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1"; chr1 HAVANA exon 12613 12721 . + . gene_id "ENSG00000223972.4"; transcript_id "ENST00000456328.2"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_status "KNOWN"; transcript_name "DDX11L1-002"; exon_number 2; exon_id "ENSE00003582793.1"; level 2; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1"; chr1 HAVANA exon 13221 14409 . + . gene_id "ENSG00000223972.4"; transcript_id "ENST00000456328.2"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_status "KNOWN"; transcript_name "DDX11L1-002"; exon_number 3; exon_id "ENSE00002312635.1"; level 2; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";
In the third column of the GTF file corresponds to the feature type. How many different types of features are there?
zcat gencode.v19.annotation.gtf.gz | grep -v "^#" | cut -f3 | sort | uniq -c | sort -k1rn 1196293 exon 723784 CDS 284573 UTR 196520 transcript 84144 start_codon 76196 stop_codon 57820 gene 114 Selenocysteine
So the exonic regions are already defined; however, I would like to remove overlapping exons. The mergeBed utility in BEDTools can accomplish this; however before merging make sure the coordinates are sorted. We can use sortBed to sort the BED file:
v=19 zcat gencode.v$v.annotation.gtf.gz | awk 'BEGIN{OFS="\t";} $3=="exon" {print $1,$4-1,$5}' | bedtools2/bin/sortBed | bedtools2/bin/mergeBed -i - | gzip > gencode_v${v}_exon_merged.bed.gz
To define intronic regions, we just need to subtract the exonic region from the genic region. The utility subtractBed can do this:
v=19 zcat gencode.v$v.annotation.gtf.gz | awk 'BEGIN{OFS="\t";} $3=="gene" {print $1,$4-1,$5}' | bedtools2/bin/sortBed | bedtools2/bin/subtractBed -a stdin -b gencode_v${v}_exon_merged.bed.gz | gzip > gencode_v${v}_intron.bed.gz #let's intersect the two files #this shouldn't produce any output intersectBed -a gencode_v${v}_exon_merged.bed.gz -b gencode_v${v}_intron.bed.gz
And finally to define intergenic regions, we use complementBed to find regions not covered by genes.
v=19 mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -e \ "select chrom, size from hg19.chromInfo" > hg19.genome zcat gencode.v$v.annotation.gtf.gz | awk 'BEGIN{OFS="\t";} $3=="gene" {print $1,$4-1,$5}' | bedtools2/bin/sortBed | bedtools2/bin/complementBed -i stdin -g hg19.genome | gzip > gencode_v${v}_intergenic.bed.gz
And that's it.
Here's a schematic showing the steps we performed above.
Genome coverage
How much of the genome does exons, introns, and intergenic regions cover?
#!/bin/env perl use strict; use warnings; my $v = 19; my $exon_file = "gencode_v${v}_exon_merged.bed.gz"; my $intergenic_file = "gencode_v${v}_intergenic.bed.gz"; my $intron_file = "gencode_v${v}_intron.bed.gz"; my $exon_coverage = coverage($exon_file); my $intergenic_coverage = coverage($intergenic_file); my $intron_coverage = coverage($intron_file); my $total = $exon_coverage + $intergenic_coverage + $intron_coverage; printf "Exon: %.2f\n", $exon_coverage*100/$total; printf "Intron: %.2f\n", $intron_coverage*100/$total; printf "Intergenic: %.2f\n", $intergenic_coverage*100/$total; sub coverage { my ($infile) = @_; my $coverage = 0; open(IN,'-|',"zcat $infile") || die "Could not open $infile: $!\n"; while(<IN>){ chomp; my ($chr, $start, $end) = split(/\t/); my $c = $end - $start; $coverage += $c; } close(IN); return($coverage); } exit(0);
Running this script:
coverage.pl Exon: 3.72 Intron: 48.25 Intergenic: 48.03
I had initially thought that the intergenic region would have the highest percent coverage in the genome compared to introns and exons.
The untranslated region (UTR)
I got a question (see comments) regarding the untranslated region (UTR). To illustrate my answer, I'll use the transcript ENST00000335295 as an example:
v=19 zcat gencode.v$v.annotation.gtf.gz | grep ENST00000335295 chr11 HAVANA transcript 5246694 5248301 . - . gene_id "ENSG00000244734.2"; transcript_id "ENST00000335295.4"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "HBB"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "HBB-001"; level 2; tag "basic"; tag "appris_principal"; tag "CCDS"; ccdsid "CCDS7753.1"; havana_gene "OTTHUMG00000066678.3"; havana_transcript "OTTHUMT00000142977.2"; chr11 HAVANA exon 5248160 5248301 . - . gene_id "ENSG00000244734.2"; transcript_id "ENST00000335295.4"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "HBB"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "HBB-001"; exon_number 1; exon_id "ENSE00001829867.2"; level 2; tag "basic"; tag "appris_principal"; tag "CCDS"; ccdsid "CCDS7753.1"; havana_gene "OTTHUMG00000066678.3"; havana_transcript "OTTHUMT00000142977.2"; chr11 HAVANA CDS 5248160 5248251 . - 0 gene_id "ENSG00000244734.2"; transcript_id "ENST00000335295.4"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "HBB"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "HBB-001"; exon_number 1; exon_id "ENSE00001829867.2"; level 2; tag "basic"; tag "appris_principal"; tag "CCDS"; ccdsid "CCDS7753.1"; havana_gene "OTTHUMG00000066678.3"; havana_transcript "OTTHUMT00000142977.2"; chr11 HAVANA start_codon 5248249 5248251 . - 0 gene_id "ENSG00000244734.2"; transcript_id "ENST00000335295.4"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "HBB"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "HBB-001"; exon_number 1; exon_id "ENSE00001829867.2"; level 2; tag "basic"; tag "appris_principal"; tag "CCDS"; ccdsid "CCDS7753.1"; havana_gene "OTTHUMG00000066678.3"; havana_transcript "OTTHUMT00000142977.2"; chr11 HAVANA exon 5247807 5248029 . - . gene_id "ENSG00000244734.2"; transcript_id "ENST00000335295.4"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "HBB"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "HBB-001"; exon_number 2; exon_id "ENSE00001057381.1"; level 2; tag "basic"; tag "appris_principal"; tag "CCDS"; ccdsid "CCDS7753.1"; havana_gene "OTTHUMG00000066678.3"; havana_transcript "OTTHUMT00000142977.2"; chr11 HAVANA CDS 5247807 5248029 . - 1 gene_id "ENSG00000244734.2"; transcript_id "ENST00000335295.4"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "HBB"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "HBB-001"; exon_number 2; exon_id "ENSE00001057381.1"; level 2; tag "basic"; tag "appris_principal"; tag "CCDS"; ccdsid "CCDS7753.1"; havana_gene "OTTHUMG00000066678.3"; havana_transcript "OTTHUMT00000142977.2"; chr11 HAVANA exon 5246694 5246956 . - . gene_id "ENSG00000244734.2"; transcript_id "ENST00000335295.4"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "HBB"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "HBB-001"; exon_number 3; exon_id "ENSE00001600613.2"; level 2; tag "basic"; tag "appris_principal"; tag "CCDS"; ccdsid "CCDS7753.1"; havana_gene "OTTHUMG00000066678.3"; havana_transcript "OTTHUMT00000142977.2"; chr11 HAVANA CDS 5246831 5246956 . - 0 gene_id "ENSG00000244734.2"; transcript_id "ENST00000335295.4"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "HBB"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "HBB-001"; exon_number 3; exon_id "ENSE00001600613.2"; level 2; tag "basic"; tag "appris_principal"; tag "CCDS"; ccdsid "CCDS7753.1"; havana_gene "OTTHUMG00000066678.3"; havana_transcript "OTTHUMT00000142977.2"; chr11 HAVANA stop_codon 5246828 5246830 . - 0 gene_id "ENSG00000244734.2"; transcript_id "ENST00000335295.4"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "HBB"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "HBB-001"; exon_number 3; exon_id "ENSE00001600613.2"; level 2; tag "basic"; tag "appris_principal"; tag "CCDS"; ccdsid "CCDS7753.1"; havana_gene "OTTHUMG00000066678.3"; havana_transcript "OTTHUMT00000142977.2"; chr11 HAVANA UTR 5248252 5248301 . - . gene_id "ENSG00000244734.2"; transcript_id "ENST00000335295.4"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "HBB"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "HBB-001"; level 2; tag "basic"; tag "appris_principal"; tag "CCDS"; ccdsid "CCDS7753.1"; havana_gene "OTTHUMG00000066678.3"; havana_transcript "OTTHUMT00000142977.2"; chr11 HAVANA UTR 5246694 5246830 . - . gene_id "ENSG00000244734.2"; transcript_id "ENST00000335295.4"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "HBB"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "HBB-001"; level 2; tag "basic"; tag "appris_principal"; tag "CCDS"; ccdsid "CCDS7753.1"; havana_gene "OTTHUMG00000066678.3"; havana_transcript "OTTHUMT00000142977.2";
This transcript (HBB) starts at 5248301 and ends at 5246694 on chromosome 11 (the coordinates are reversed because it's on the negative strand). We see two lines with the UTR feature for this transcript, one starting at 5248301 and the other at 5246830.
So the coordinates for the 5' UTR are chr11:5248252-5248301 and the 3' UTR are chr11:5246694-5246830. Perhaps I should have picked a transcript that's on the positive strand, however this was one example I knew where the transcript is short and well defined (contains the start and end codons, and the 5' and 3' UTRs). Which now leads me to wonder, how many transcripts have defined UTRs?
#!/bin/env perl use strict; use warnings; my $usage = "Usage: $0 <infile.annotation.gtf.gz>\n"; my $infile = shift or die $usage; if ($infile =~ /\.gz/){ open(IN,'-|',"gunzip -c $infile") || die "Could not open $infile: $!\n"; } else { open(IN,'<',$infile) || die "Could not open $infile: $!\n"; } my %transcript = (); my $current_transcript = ''; while(<IN>){ chomp; next if (/^#/); #chr11 HAVANA transcript 65265233 65273940 . + . gene_id "ENSG00000251562.3"; transcript_id "ENST00000534336.1"; gene_type "processed_transcript"; gene_status "KNOWN"; gene_name "MALAT1"; transcript_type "non_coding"; transcript_status "KNOWN"; transcript_name "MALAT1-001"; level 2; havana_gene "OTTHUMG00000166322.1"; havana_transcript "OTTHUMT00000389143.1"; my ($chr,$source,$type,$start,$end,$score,$strand,$phase,$annotation) = split(/\t/); my @annotation = split(/;\s/,$annotation); my $transcript_id = 'none'; if ($type eq 'transcript'){ foreach my $blah (@annotation){ my ($type,$name) = split(/\s+/,$blah); if ($type eq 'transcript_id'){ $current_transcript = $name; $current_transcript =~ s/"//g; $transcript{$current_transcript} = 0; } } if ($current_transcript eq 'none'){ die "No name for entry $.\n"; } } if ($type eq 'UTR'){ $transcript{$current_transcript}++; } } close(IN); foreach my $transcript (keys %transcript){ print "$transcript\t$transcript{$transcript}\n"; } exit(0);
Now to run the script:
check_utr.pl gencode.v19.annotation.gtf.gz | gzip > transcript_utr_number.out.gz zcat transcript_utr_number.out | wc -l 196520 zcat transcript_utr_number.out | cut -f2 | sort | uniq -c | sort -k1rn | head 103800 0 34999 2 24234 3 11543 1 10352 4 4493 5 2264 6 1297 7 796 8 583 9 zcat transcript_utr_number.out | awk '$2==57 {print}' ENST00000264319.7 57
Running the script above, I find that over half of the GENCODE transcripts (103,800/196,520) don't have a defined UTRs. 34,999 transcripts have 2 defined UTRs, 24,234 have 3 because one of the UTR is spliced. The transcript ENST00000264319.7 had 57 UTR lines, which was the most UTR annotations for a transcript.
Defining the 5' and 3' UTRs
I'm not entirely sure why the UTRs weren't separated into 5' and 3', especially when GENCODE went through the trouble of labelling the start_codon and the stop_codon. I'm going to assume that if a UTR is closer to the starting position of the transcript it's the 5' UTR and vice versa.
#!/bin/env perl use strict; use warnings; my $usage = "Usage: $0 <infile.annotation.gtf>\n"; my $infile = shift or die $usage; if ($infile =~ /\.gz/){ open(IN,'-|',"gunzip -c $infile") || die "Could not open $infile: $!\n"; } else { open(IN,'<',$infile) || die "Could not open $infile: $!\n"; } my %transcript = (); my $current_transcript = ''; my $transcript_start = 0; my $transcript_end = 0; while(<IN>){ chomp; next if (/^#/); #chr11 HAVANA transcript 65265233 65273940 . + . gene_id "ENSG00000251562.3"; transcript_id "ENST00000534336.1"; gene_type "processed_transcript"; gene_status "KNOWN"; gene_name "MALAT1"; transcript_type "non_coding"; transcript_status "KNOWN"; transcript_name "MALAT1-001"; level 2; havana_gene "OTTHUMG00000166322.1"; havana_transcript "OTTHUMT00000389143.1"; my ($chr,$source,$type,$start,$end,$score,$strand,$phase,$annotation) = split(/\t/); my @annotation = split(/;\s/,$annotation); my $transcript_id = 'none'; if ($type eq 'transcript'){ foreach my $blah (@annotation){ my ($type,$name) = split(/\s+/,$blah); if ($type eq 'transcript_id'){ $current_transcript = $name; $current_transcript =~ s/"//g; $transcript_start = $start; $transcript_end = $end; } } if ($current_transcript eq 'none'){ die "No name for entry $.\n"; } } if ($type eq 'UTR'){ my $region = ''; if ($strand eq '+'){ my $dis_to_start = abs($start - $transcript_start); my $dis_to_end = abs($start - $transcript_end); $region = $dis_to_start < $dis_to_end ? '5_UTR' : '3_UTR'; } else { my $dis_to_start = abs($end - $transcript_end); my $dis_to_end = abs($end - $transcript_start); $region = $dis_to_start < $dis_to_end ? '5_UTR' : '3_UTR'; } print join ("\t", $chr, $start, $end, $region, $current_transcript, $strand),"\n"; } } close(IN); exit(0);
To run the script:
print_utr.pl gencode.v19.annotation.gtf.gz | gzip > transcript_utr.bed.gz zcat transcript_utr.bed.gz | head chr1 70006 70008 3_UTR ENST00000335137.3 + chr1 137621 138532 5_UTR ENST00000423372.3 - chr1 139310 139379 5_UTR ENST00000423372.3 - chr1 134901 135802 3_UTR ENST00000423372.3 - chr1 367640 367658 5_UTR ENST00000426406.1 + chr1 368595 368634 3_UTR ENST00000426406.1 + chr1 621059 621098 3_UTR ENST00000332831.2 - chr1 622035 622053 5_UTR ENST00000332831.2 - chr1 819981 819983 3_UTR ENST00000594233.1 + chr1 860260 860328 5_UTR ENST00000420190.1 +
Basically the script gets the coordinates of the transcript, and whenever there is an UTR annotation, it checks to see if the UTR coordinates are closer to the start or the end of the transcript.
Promoter region
Here's a script that reads the GENCODE annotation file, obtains the starting positions of each transcript, adds some user defined padding around this starting position and outputs it as a bed file. The starting position plus some padding is defined as the promoter region. Note that this script assumes that the hg19.genome file is available in the same directory as the script.
#!/bin/env perl use strict; use warnings; my $usage = "Usage: $0 <infile.annotation.gtf> <padding>\n"; my $infile = shift or die $usage; my $span = shift or die $usage; if ($span !~ /^\d+$/){ die "Please enter a numeric value for the padding\n"; } my $hg19 = 'hg19.genome'; my %hg19 = (); open(IN,'<',$hg19) || die "Could not open $hg19: $!\n"; while(<IN>){ chomp; #chr9_gl000201_random 36148 my ($chr, $end) = split(/\t/); $hg19{$chr} = $end; } close(IN); if ($infile =~ /\.gz/){ open(IN,'-|',"gunzip -c $infile") || die "Could not open $infile: $!\n"; } else { open(IN,'<',$infile) || die "Could not open $infile: $!\n"; } while(<IN>){ chomp; next if (/^#/); #chr11 HAVANA transcript 65265233 65273940 . + . gene_id "ENSG00000251562.3"; transcript_id "ENST00000534336.1"; gene_type "processed_transcript"; gene_status "KNOWN"; gene_name "MALAT1"; transcript_type "non_coding"; transcript_status "KNOWN"; transcript_name "MALAT1-001"; level 2; havana_gene "OTTHUMG00000166322.1"; havana_transcript "OTTHUMT00000389143.1"; my ($chr,$source,$type,$start,$end,$score,$strand,$phase,$annotation) = split(/\t/); next unless $type eq 'transcript'; my @annotation = split(/;\s/,$annotation); my $transcript_id = 'none'; foreach my $blah (@annotation){ my ($type,$name) = split(/\s+/,$blah); if ($type eq 'transcript_id'){ $transcript_id = $name; $transcript_id =~ s/"//g; } } if ($transcript_id eq 'none'){ die "No name for entry $.\n"; } my $promoter_start = ''; my $promoter_end = ''; if ($strand eq '+'){ $promoter_start = $start - $span; $promoter_end = $start + $span; } else { $promoter_start = $end - $span; $promoter_end = $end + $span; } if ($promoter_start < 0){ warn "Adjusted promoter start to 0\n"; $promoter_start = 0; } elsif ($promoter_end > $hg19{$chr}){ warn "Adjusted promoter end to $hg19{$chr}\n"; $promoter_end = $hg19{$chr}; } print join("\t",$chr,$promoter_start,$promoter_end,$transcript_id,0,$strand),"\n"; } close(IN); exit(0);
Running the script:
promoter.pl gencode.v19.annotation.gtf.gz 200 | head chr1 11669 12069 ENST00000456328.2 0 + chr1 11672 12072 ENST00000515242.2 0 + chr1 11674 12074 ENST00000518655.2 0 + chr1 11810 12210 ENST00000450305.2 0 + chr1 29170 29570 ENST00000438504.2 0 - chr1 24686 25086 ENST00000541675.1 0 - chr1 29170 29570 ENST00000423562.1 0 - chr1 29370 29770 ENST00000488147.1 0 - chr1 29606 30006 ENST00000538476.1 0 - chr1 29354 29754 ENST00000473358.1 0 +
See also
See the BEDTools documentation for cool usage examples.
All code used in this post is available at https://github.com/davetang/defining_genomic_regions.
This work is licensed under a Creative Commons
Attribution 4.0 International License.
Hi Dave,
Great resources and tutorials to extract the genomic regions from the GENCODE file.
Here, I want to look into the 3’UTR and 5’UTR regions in the gencode gtf file
altough the gtf file has a thrid column containg UTR info, it does not seperate the 3’UTR and 5’UTR.
Moreover, in UCSC browser, it offers us the 3’UTR&5’UTR bed file for V14 gencode.
So my question is how can we find this info from the gencode file from http://www.gencodegenes.org/releases/17.html, i.e., V17, the latest version gtf file?
Thanks
Hi Asaki,
Thank you for the compliment. As for the UTRs, since we know the orientation of the gene/transcript, from the gtf file, we know whether the UTR is at the 5′ or 3′ end.
Bioconductor contains GENCODE genes, but it is not the latest version (since it’s derived from the UCSC Genome Browser).
It wouldn’t be too difficult to write a Perl script that extracts UTRs based on the gene orientation but some people are against writing custom scripts. I’m fine with custom scripts as long as they work as expected and are shared.
Cheers,
Dave
How about doing the same thing on a plant genome or for other non model organism for which something like gencode is not available.
Yeah just use another set of transcript annotations and adapt the same concept.
What if the organism does not have annotations of UTR and also how to get it in a format like gencode has. For example I have this file :-
ftp://ftp.jgi-psf.org/pub/compgen/phytozome/v9.0/Gmax/annotation/Gmax_189_gene.gff3.gz
what additional steps do I need to follow to get it into a format similar to gencode.
Also gencode has lot of annotations that my file does not so your script does not seems to work with it.
Hi Saad,
I downloaded the file and had a look; it has annotations for:
CDS
five_prime_UTR
gene
mRNA
three_prime_UTR
It seems the UTRs are defined already?
Cheers,
Dave
Hi Dave,
I was more interested to generate a file similar to what UCSC generates. Unfortunately my organism is not there in UCSC browser.
This is the kind of output I am looking for. Do you think if there are already tools to do that or do I have to write a script for it myself.
output by ucsc : –
chr1 14969 15038 NR_024540_utr3_1_0_chr1_14970_r 0 – NR_024540 WASH7P 653635 pseudogene
Thank you very much for the above executed scripts, i was able to cross check my ouput with that of bedtools. I have no transcript info. All i have is exon and cds.The difference lowest start point of cds and lowest start point of exon will be my 5’UTR and difference in highest will be my 3’UTR.
The following is my GFF file can you please help me extract the UTR regions?
Ca1 Gnomon gene 254 741 . – . ID=gene0;Name=LOC101497325;Dbxref=GeneID:101497325;gbkey=Gene;gene=LOC101497325
Ca1 Gnomon mRNA 254 741 . – . ID=rna0;Name=XM_004486173.1;Parent=gene0;Dbxref=GeneID:101497325,Genbank:XM_004486173.1;gbkey=mRNA;gene=LOC101497325;product=type II inositol 1%2C4%2C5-trisphosphate 5-phosphatase FRA3-like;transcript_id=XM_004486173.1
Ca1 Gnomon exon 602 741 . – . ID=id1;Parent=rna0;Dbxref=GeneID:101497325,Genbank:XM_004486173.1;gbkey=mRNA;gene=LOC101497325;product=type II inositol 1%2C4%2C5-trisphosphate 5-phosphatase FRA3-like;transcript_id=XM_004486173.1
Ca1 Gnomon exon 254 506 . – . ID=id2;Parent=rna0;Dbxref=GeneID:101497325,Genbank:XM_004486173.1;gbkey=mRNA;gene=LOC101497325;product=type II inositol 1%2C4%2C5-trisphosphate 5-phosphatase FRA3-like;transcript_id=XM_004486173.1
Ca1 Gnomon CDS 602 663 . – 0 ID=cds0;Name=XP_004486230.1;Parent=rna0;Dbxref=GeneID:101497325,Genbank:XP_004486230.1;gbkey=CDS;product=type II inositol 1%2C4%2C5-trisphosphate 5-phosphatase FRA3-like;protein_id=XP_004486230.1
Ca1 Gnomon CDS 254 506 . – 1 ID=cds0;Name=XP_004486230.1;Parent=rna0;Dbxref=GeneID:101497325,Genbank:XP_004486230.1;gbkey=CDS;product=type II inositol 1%2C4%2C5-trisphosphate 5-phosphatase FRA3-like;protein_id=XP_004486230.1
Ca1 Gnomon gene 18089 20552 . – . ID=gene1;Name=LOC101488339;Dbxref=GeneID:101488339;gbkey=Gene;gene=LOC101488339
Ca1 Gnomon gene 165984 168949 . + . ID=gene10;Name=LOC101493009;Dbxref=GeneID:101493009;gbkey=Gene;gene=LOC101493009
Ca1 Gnomon mRNA 165984 168949 . + . ID=rna18;Name=XM_004485362.1;Parent=gene10;Dbxref=GeneID:101493009,Genbank:XM_004485362.1;gbkey=mRNA;gene=LOC101493009;product=uncharacterized LOC101493009;transcript_id=XM_004485362.1
Ca1 Gnomon exon 165984 166302 . + . ID=id169;Parent=rna18;Dbxref=GeneID:101493009,Genbank:XM_004485362.1;gbkey=mRNA;gene=LOC101493009;product=uncharacterized LOC101493009;transcript_id=XM_004485362.1
Ca1 Gnomon exon 167289 168949 . + . ID=id170;Parent=rna18;Dbxref=GeneID:101493009,Genbank:XM_004485362.1;gbkey=mRNA;gene=LOC101493009;product=uncharacterized LOC101493009;transcript_id=XM_004485362.1
Ca1 Gnomon CDS 167476 168681 . + 0 ID=cds17;Name=XP_004485419.1;Parent=rna18;Dbxref=GeneID:101493009,Genbank:XP_004485419.1;gbkey=CDS;product=uncharacterized protein LOC101493009;protein_id=XP_004485419.1
Hi Alok,
if I understood you correctly, then the 5′ UTR for LOC101497325 is Ca1:663-741 and the 3′ UTR is Ca1:254-506 and for LOC101493009 the 5′ UTR is Ca1:165984-166302 and the 3′ UTR is Ca1:168681-168949.
Cheers,
Dave
Thanks Dave for this very nice post.
I found a small in the conversion of GTF format to BED format. GTF has 1-based start coordinates, and BED has 0-based start coordinates. Thus the awk command to convert should for example be:
zcat gencode.v$v.annotation.gtf.gz |
awk ‘BEGIN{OFS=”\t”;} $3==”exon” {print $1,$4-1,$5}’
etc.
Thanks
Julien
Thanks Julien! This off-by-one error caused by 0- and 1-based coordinates is probably the most common bioinformatic error 🙂
Hi Dave,
nice piece of work, thanks. Wondering about your thoughts on how to avoid issues when merging the exons due to overlapping genes on each strand. How do you then know which gene/transcript is aligned to?
Thanks,
Bruce.
Hi Bruce,
thanks for the comment. This post was definitely simplifying the task of annotating mapped reads, which as you point out can get quite complicated (and messy).
One approach a colleague took, was to create a table for mapped reads (or clusters of reads), where each column was a genomic property, e.g. antisense to promoter, sense to exon, etc., and each column would be filled with the transcript ID that fulfilled that property. At the end, he could be able to tell the properties of a read. Needless to say he didn’t merge exons, like I did, or else he would lose information.
Cheers,
Dave
Can you share your intergenicRegion.bed file?
Thanks a lot, Dave.
It’s a wonderful post. While when I download gencode.v20.annotation.gtf.gz and so forth, it gives the error messages as below,
====
Checking and Creating Intergenic Regions
Warning: end of BED entry exceeds chromosome length. Please correct.
====
I also check the entry count for each output.
====
152320 gencode.v19.annotation.gtf.gz
8543 gencode_v19_exon_merged.bed.gz
575 gencode_v19_intergenic.bed.gz
6805 gencode_v19_intron.bed.gz
8060 promoter.bed.gz
8243 transcript_utr.bed.gz
3292 transcript_utr_number.out.gz
====
158000 gencode.v23.annotation.gtf.gz
8116 gencode_v23_exon_merged.bed.gz
187 gencode_v23_intergenic.bed.gz
6998 gencode_v23_intron.bed.gz
8731 promoter.bed.gz
8234 transcript_utr.bed.gz
3068 transcript_utr_number.out.gz
====
The gencode_v23_intergenic.bed.gz is much smaller than gencode_v19_intergenic.bed.gz, is it correct output?
Thanks a lot for your comment.
Hi Ben,
GENCODE version 19 is the last version of GENCODE that uses the hg19 reference genome; that’s why you’re getting the warnings (a good thing!).
In the MySQL command, change it to: “select chrom, size from hg38.chromInfo” and save the output as hg38.genome and change the workflow accordingly.
Cheers,
Dave
Hi Dave,
Very nice post. All goes well except when I try to isolate intergenic regions. I get an error: Less than the req’d two fields were encountered in the genome file. May I please request you to help me!
Can you confirm that your genome file looks something like this?
cat hg19.genome | head
chrom size
chr1 249250621
chr2 243199373
chr3 198022430
chr4 191154276
chr5 180915260
chr6 171115067
chr7 159138663
chrX 155270560
chr8 146364022
Hi Dave,
Thanks so much. It works very well now!
I was put Ca1 instead refseq Sequence NC_021160.1.
Chromosome Size
NC_021160.1 48359943
NC_021161.1 36634854
NC_021162.1 39989001
NC_021163.1 49191682
NC_021164.1 48169137
NC_021165.1 59463898
NC_021166.1 48961560
NC_021167.1 16477302
Hi Dave,
Thansk for this. How can I cite you.
Hi Mayank,
you can use the URL of the post if you really want to cite me. Otherwise, just cite:
https://www.ncbi.nlm.nih.gov/pubmed/20110278
Cheers,
Dave
Hi Dave,
Really good post…..I have created intron.bed file but for intergenic region I am getting error:
Error: The genome file gen.size has no valid entries. Exiting.
I have created this file from (ftp://ftp.solgenomics.net/tomato_genome/assembly/current_build/stats_chromosomes.pdf) pdf and single space delimiter used.
chrom size
chr0 21805821
chr1 90304244
chr2 49918294
chr3 64840714
chr4 64064312
chr5 65021438
chr6 46041636
chr7 65268621
chr8 63032657
chr9 67662091
chr10 64834305
chr11 53386025
chr12 65486253
even I tried tab as delimiter but it gave this error:
***** ERROR: illegal character ‘ ‘ found in integer conversion of string ” 21805821″. Exiting…
chrom size
chr0 21805821
chr1 90304244
chr2 49918294
chr3 64840714
chr4 64064312
chr5 65021438
chr6 46041636
chr7 65268621
chr8 63032657
chr9 67662091
chr10 64834305
chr11 53386025
chr12 65486253
Copy text from PDFs can sometimes introduce unsupported characters, which may be what the “illegal character” warning is about. Try to copy the coordinates from a plain text file instead.
Hi Dave,
Thanks for your great post.
Just to check – does the intergenic regions created here include or exclude introns?
Cheers
Hi,
When I try
‘wget ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_$v/gencode.v$v.annotation.gtf.gz‘
I get the message: ‘No such directory `pub/gencode/Gencode_human/release_’.’
Is this because of the new hg version? How would I adapt this pipeline for hg38?
Thank you.
Hi Dave,
Can you comment on how to select the “padding” for promoter assignment? What would be a reasonable padding for hg38? Are there any good criteria to choose it?
Thanks,
Siyu
Hi Siyu,
for mammalian promoters, I use -300/+100, which is based on observations from CAGE data (see https://pubmed.ncbi.nlm.nih.gov/16645617/). If you are unfamiliar with CAGE, I have an old set of slides at http://davetang.github.io/cage_r/#1. I’m not so sure with non-mammalian promoters.
Hope that helps,
Dave
Hi Dave,
Thanks for the prompt reply and additional valuable resources! This is super helpful~
Happy holidays~
Best,
Siyu