I have started to work on Arabidopsis thaliana, as I mentioned in my last post. As noted in the Encyclopedia of life:
Arabidopsis thaliana is the most widely used model organism in plant biology. Its small genome size, fully sequenced in the year 2000, chromosome number, fast growth cycle (from seed germination to set in four to six weeks), small size (hundreds can be grown in a pot and thousands in a growth chamber), autogamous breeding system (induced mutations are expressed in two generations), and ability to grow on various synthetic media, all make the species an ideal system in experimental biology.
As indispensable tool for genomics is a genome browser and I have been a long time fan of the UCSC Genome Browser. To my initial dismay, the Arabidopsis genome was not available on the Genome Browser Gateway page. But later I found an Assembly Hub for plants, which included the TAIR10 genome assembly. Simply head over to the Track Data Hubs page, find "CSHL Biology of Genomes meeting 2013 demonstration assembly hub", and click connect.
Not as many available tracks as compared to the human genome browser page.
In addition, you can download the genome sequence and other useful files from the assembly hub.
# download the genome wget http://genome-test.cse.ucsc.edu/~hiram/hubs/Plants/araTha1/araTha1.2bit # download the chrom sizes, which is useful for BEDTools wget http://genome-test.cse.ucsc.edu/~hiram/hubs/Plants/araTha1/araTha1.chrom.sizes # convert to fasta # download executables from http://genome-test.cse.ucsc.edu/~kent/exe/ twoBitToFa araTha1.2bit araTha1.fa head araTha1.fa >chr1 ccctaaaccctaaaccctaaaccctaaacctctgaatccttaatccctaa atccctaaatctttaaatcctacatccatgaatccctaaatacctaattc cctaaacCCGAAACCGGTTTCTCTGGTTGAAAATCATTGTGTATATAATG ATAATTTTATCGTTTTTATGTAATTGCTTATTGTTGTGTGTAGATTTTTT AAAAATATCATTTGAGGTCAATACAAATCCTATTTCTTGTGGTTTTCTTT CCTTCACTTAGCTATGGATGGTTTATCTTCATTTGTTATATTGGATACAA GCTTTGCTACGATCTACATTTGGGAATGTGAGTCTCTTATTGTAACCTTA GGGTTGGTTTATCTCAAGAATCTTATTAATTGTTTGGACTGTTTATGTTT GGACATTTATTGTCATTCTTACTCCTTTGTGGAAATGTTTGTTCTATCAA # random FASTA file head random.fa >chr3:14195741-14196603 TCAAATTCCAAGTATTTCTTTTTTTTTGGCACCGGTGTCTCCTCAGACAT TTCAATGTCTGTTGGTGCCAAGAGGGAAAAGGGCTATTAAGCTATATAGG GGGGTGGGTGTTGAGGGAGTCTGGGCAGTCCGTGGGAACCCCCTTTATCG GTTCGGACTTGGGTAGCGATCGAGGGATGGTATCGGATATCGACACGAGG AATGACCGACCGTCCGGCCGCCGGGATTTTCGCCGGAAAACTTTTCCGGC GACTTTTCCGGCGATCGGTTTTGTTGCCTTTTTCCGAGTTTTCTCAGCAG TTCTCGGACAAAAACTGCTGAATCGTCGAGGAGAATGGGCTTGCCTTGCG TGGGCTGCCATTAGTTCTTCGAGGCGTTAGGGTGGCGGCGGTATAAAAGT GTCGGAGTTTTTTCAGCAGTTCTCGGACAAAAATTGCTGAGTGGCCGAGA # blat random FASTA file # see https://davetang.org/muse/2012/05/15/using-blat/ blat -minScore=0 -stepSize=1 araTha1.2bit random.fa random.psl Loaded 119667750 letters in 7 sequences Searched 863 bases in 1 sequences # clone my PSL to BED script git clone https://gist.github.com/7314846.git perl 7314846/psl_to_bed_best_score.pl random.psl random.bed cat random.bed chr3 14195740 14196603 chr3:14195741-14196603 863 + 14195740 14196603 255,0,0 1 863, 0
Bioconductor
Bioconductor provides useful annotation packages for various organisms, including Arabidopsis thaliana. The annotation packages are quite easy to use; the select(), columns(), and keys() functions are used together to extract data from an AnnotationDb object. The org.At.tair.db package provides genome wide annotation for Arabidopsis.
# install if necessary
source("https://bioconductor.org/biocLite.R")
biocLite("org.At.tair.db")
# load library
library(org.At.tair.db)
# The name of an org package is always of the form org.Ab.id.db
# where Ab is a 2-letter abbreviation of the organism and
# id is an abbreviation (in lower-case) describing the type of central identifier
class(org.At.tair.db)
[1] "OrgDb"
attr(,"package")
[1] "AnnotationDbi"
# columns shows which kinds of data can be returned for the AnnotationDb object
columns(org.At.tair.db)
[1] "ARACYC" "ARACYCENZYME" "ENTREZID" "ENZYME" "EVIDENCE" "EVIDENCEALL" "GENENAME"
[8] "GO" "GOALL" "ONTOLOGY" "ONTOLOGYALL" "PATH" "PMID" "REFSEQ"
[15] "SYMBOL" "TAIR"
# map ENTREZIDs to gene symbols and TAIR IDs
# store six Entrez IDs
my_key <- head(keys(org.At.tair.db, keytype="ENTREZID"))
head(my_key)
[1] "839580" "839569" "839321" "839574" "839579" "839341"
# columns to return
my_col <- c('SYMBOL', 'TAIR')
# query!
select(org.At.tair.db, keys=my_key, columns=my_col, keytype="ENTREZID")
'select()' returned 1:many mapping between keys and columns
ENTREZID SYMBOL TAIR
1 839580 ANAC001 AT1G01010
2 839580 NAC001 AT1G01010
3 839569 ARV1 AT1G01020
4 839321 NGA3 AT1G01030
5 839574 ASU1 AT1G01040
6 839574 ATDCL1 AT1G01040
7 839574 CAF AT1G01040
8 839574 DCL1 AT1G01040
9 839574 EMB60 AT1G01040
10 839574 EMB76 AT1G01040
11 839574 SIN1 AT1G01040
12 839574 SUS1 AT1G01040
13 839579 AtPPa1 AT1G01050
14 839579 PPa1 AT1G01050
15 839341 LHY AT1G01060
16 839341 LHY1 AT1G01060
# number of TAIR IDs
length(keys(org.At.tair.db, keytype = 'TAIR'))
[1] 27416
For example, the annotation package makes it easy to find potential transcription factors; I just need to find genes annotated with the gene ontology (GO) accession "GO:0003700", which is for transcription factor activity, sequence-specific DNA binding.
my_go <- 'GO:0003700'
my_go_tf <- select(org.At.tair.db, keys=my_go, columns=my_col, keytype="GO")
'select()' returned 1:many mapping between keys and columns
head(my_go_tf)
GO EVIDENCE ONTOLOGY SYMBOL TAIR
1 GO:0003700 ISS MF ANAC001 AT1G01010
2 GO:0003700 ISS MF NAC001 AT1G01010
3 GO:0003700 ISS MF NGA3 AT1G01030
4 GO:0003700 IDA MF LHY AT1G01060
5 GO:0003700 IDA MF LHY1 AT1G01060
6 GO:0003700 ISS MF LHY AT1G01060
# number of potential transcription factors
length(unique(my_go_tf$TAIR))
[1] 1669
I can use the GO annotation package to find gene ontologies associated with a gene of interest.
my_gene <- 'AT5G60130'
my_gene_go <- select(org.At.tair.db, keys=my_gene, columns='GO', keytype="TAIR")
'select()' returned 1:many mapping between keys and columns
my_gene_go
TAIR GO EVIDENCE ONTOLOGY
1 AT5G60130 GO:0003677 ISS MF
2 AT5G60130 GO:0003700 ISS MF
3 AT5G60130 GO:0005634 ISM CC
4 AT5G60130 GO:0006355 ISS BP
# install if necessary
source("https://bioconductor.org/biocLite.R")
biocLite("GO.db")
library(GO.db)
Term(my_gene_go$GO)
GO:0003677
"DNA binding"
GO:0003700
"transcription factor activity, sequence-specific DNA binding"
GO:0005634
"nucleus"
GO:0006355
"regulation of transcription, DNA-templated"
The TxDb.Athaliana.BioMart.plantsmart28 annotation package provides coordinates for Arabidopsis thaliana genes and transcripts.
# install if necessary
source("https://bioconductor.org/biocLite.R")
biocLite("TxDb.Athaliana.BioMart.plantsmart28")
library(TxDb.Athaliana.BioMart.plantsmart28)
columns(TxDb.Athaliana.BioMart.plantsmart28)
[1] "CDSCHROM" "CDSEND" "CDSID" "CDSNAME" "CDSSTART" "CDSSTRAND" "EXONCHROM" "EXONEND"
[9] "EXONID" "EXONNAME" "EXONRANK" "EXONSTART" "EXONSTRAND" "GENEID" "TXCHROM" "TXEND"
[17] "TXID" "TXNAME" "TXSTART" "TXSTRAND" "TXTYPE"
all_gene <- genes(TxDb.Athaliana.BioMart.plantsmart28)
all_gene
GRanges object with 33602 ranges and 1 metadata column:
seqnames ranges strand | gene_id
<Rle> <IRanges> <Rle> | <character>
AT1G01010 1 [ 3631, 5899] + | AT1G01010
AT1G01020 1 [ 5928, 8737] - | AT1G01020
AT1G01030 1 [11649, 13714] - | AT1G01030
AT1G01040 1 [23146, 31227] + | AT1G01040
AT1G01046 1 [28500, 28706] + | AT1G01046
... ... ... ... . ...
ATMG01370 Mt [360717, 361052] - | ATMG01370
ATMG01380 Mt [361062, 361179] - | ATMG01380
ATMG01390 Mt [361350, 363284] - | ATMG01390
ATMG01400 Mt [363725, 364042] + | ATMG01400
ATMG01410 Mt [366086, 366700] - | ATMG01410
-------
seqinfo: 7 sequences (1 circular) from an unspecified genome
# See GRanges vignette for a nice introduction
# http://bioconductor.org/packages/devel/bioc/vignettes/GenomicRanges/inst/doc/GenomicRangesIntroduction.pdf
seqnames(all_gene)
factor-Rle of length 33602 with 7 runs
Lengths: 8433 5513 6730 5140 7507 133 146
Values : 1 2 3 4 5 Pt Mt
Levels(7): 1 2 3 4 5 Mt Pt
# find location of gene
my_gene <- 'AT5G60130'
all_gene[names(all_gene)==my_gene]
GRanges object with 1 range and 1 metadata column:
seqnames ranges strand | gene_id
<Rle> <IRanges> <Rle> | <character>
AT5G60130 5 [24211871, 24213164] - | AT5G60130
-------
seqinfo: 7 sequences (1 circular) from an unspecified genome
# find transcripts for gene
my_col <- columns(TxDb.Athaliana.BioMart.plantsmart28)
select(TxDb.Athaliana.BioMart.plantsmart28,
keys = my_gene,
columns = my_col,
keytype = 'GENEID')
'select()' returned 1:many mapping between keys and columns
GENEID CDSID CDSNAME CDSCHROM CDSSTRAND CDSSTART CDSEND EXONID EXONNAME EXONCHROM EXONSTRAND
1 AT5G60130 145033 <NA> 5 - 24212956 24212958 168214 AT5G60130.1.exon1 5 -
2 AT5G60130 145031 <NA> 5 - 24212216 24212854 168212 AT5G60130.1.exon2 5 -
3 AT5G60130 145029 <NA> 5 - 24211871 24212131 168210 AT5G60130.2.exon5 5 -
4 AT5G60130 145035 <NA> 5 - 24213103 24213112 168216 AT5G60130.2.exon1 5 -
5 AT5G60130 145034 <NA> 5 - 24212956 24213026 168215 AT5G60130.2.exon2 5 -
6 AT5G60130 145032 <NA> 5 - 24212390 24212854 168213 AT5G60130.2.exon3 5 -
7 AT5G60130 145030 <NA> 5 - 24212216 24212311 168211 AT5G60130.2.exon4 5 -
8 AT5G60130 145029 <NA> 5 - 24211871 24212131 168210 AT5G60130.2.exon5 5 -
EXONSTART EXONEND TXID EXONRANK TXNAME TXTYPE TXCHROM TXSTRAND TXSTART TXEND
1 24212956 24212958 40825 1 AT5G60130.1 protein_coding 5 - 24211871 24212958
2 24212216 24212854 40825 2 AT5G60130.1 protein_coding 5 - 24211871 24212958
3 24211871 24212131 40825 3 AT5G60130.1 protein_coding 5 - 24211871 24212958
4 24213103 24213164 40826 1 AT5G60130.2 protein_coding 5 - 24211871 24213164
5 24212956 24213026 40826 2 AT5G60130.2 protein_coding 5 - 24211871 24213164
6 24212390 24212854 40826 3 AT5G60130.2 protein_coding 5 - 24211871 24213164
7 24212216 24212311 40826 4 AT5G60130.2 protein_coding 5 - 24211871 24213164
8 24211871 24212131 40826 5 AT5G60130.2 protein_coding 5 - 24211871 24213164
Promoter sequence
I will use the Arabidopsis thaliana genome sequence provided on the UCSC Genome Browser test site (see above); note that araTha1 corresponds to TAIR10. We will use SAMtools to extract the promoter sequence from the genome; first we need to index the FASTA file.
samtools faidx araTha1.fa # note the 1 base start head -2 araTha1.fa >chr1 ccctaaaccctaaaccctaaaccctaaacctctgaatccttaatccctaa samtools faidx araTha1.fa chr1:0-11 >chr1:0-11 ccctaaaccct samtools faidx araTha1.fa chr1:1-11 >chr1:1-11 ccctaaaccct
Now we just need to provide the coordinates of promoter sequences to SAMtools. As far as I'm aware, there's no clear boundary as to where a promoter starts and ends; I've seen people use -300/+100 bp or -1000/+1000. For this post, I'll use -500/+500 and you can modify it as you please. However, we need the coordinates of the genes. I will use the GTF file provided by Ensembl Plants.
wget -c ftp://ftp.ensemblgenomes.org/pub/release-37/plants/gtf/arabidopsis_thaliana/Arabidopsis_thaliana.TAIR10.37.gtf.gz
gunzip -c Arabidopsis_thaliana.TAIR10.37.gtf.gz | head
#!genome-build TAIR10
#!genome-version TAIR10
#!genome-date 2010-09
#!genome-build-accession GCA_000001735.1
#!genebuild-last-updated 2010-09
1 araport11 gene 3631 5899 . + . gene_id "AT1G01010"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding";
1 araport11 transcript 3631 5899 . + . gene_id "AT1G01010"; transcript_id "AT1G01010.1"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding";
1 araport11 exon 3631 3913 . + . gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "1"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding"; exon_id "AT1G01010.1.exon1";
1 araport11 CDS 3760 3913 . + 0 gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "1"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding"; protein_id "AT1G01010.1"; protein_version "1";
1 araport11 start_codon 3760 3762 . + 0 gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "1"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding";
# how many genes?
gunzip -c Arabidopsis_thaliana.TAIR10.37.gtf.gz | grep -v "^#" | awk '$3 == "gene" {print}' | perl -nle 'if (/gene_id\s"(\w+)"/) { print $1 }' | sort -u | wc -l
33977
I would first like to check whether the coordinates are correct for araTha1. I'll download the cDNA sequences, pick a random entry (that only has one transcript), map it to araTha1 using blat, and compare the mapped coordinates to the corresponding coordinates in the GTF file.
wget -c ftp://ftp.ensemblgenomes.org/pub/release-37/plants/fasta/arabidopsis_thaliana/cdna/Arabidopsis_thaliana.TAIR10.cdna.all.fa.gz
blat araTha1.fa AT3G55270.fa AT3G55270.psl
Loaded 119667750 letters in 7 sequences
Searched 3130 bases in 1 sequences
# clone my PSL to BED script
git clone https://gist.github.com/7314846.git
perl 7314846/psl_to_bed_best_score.pl AT3G55270.psl AT3G55270.bed
Stored 1 entries
cat AT3G55270.bed
chr3 20496080 20500019 AT3G55270 3127 + 20496080 20500019 255,0,0 4 179,2493,158,300, 0,498,3271,3639
gunzip -c Arabidopsis_thaliana.TAIR10.37.gtf.gz | grep AT3G55270 | awk '$3 == "gene" {print}'
3 araport11 gene 20496081 20500019 . + . gene_id "AT3G55270"; gene_name "MKP1"; gene_source "araport11"; gene_biotype "nontranslating_CDS";
The coordinates are the same; BED is 0-based while GTF is 1-based. Now, we just need to calculate the promoter coordinates by simply adding and subtracted 500 bp from the start of the gene. Note that when a gene is on the negative strand, the end coordinate is the start. Below is a short Perl script, which I have named get_promoter.pl, that creates a BED file with the promoter coordinates.
#!/usr/bin/env perl
use strict;
use warnings;
my $infile = "Arabidopsis_thaliana.TAIR10.37.gtf.gz";
my $upstream = 500;
my $downstream = 500;
open(IN, '-|', "gunzip -c $infile") || die "Could not open $infile: $!\n";
while(<IN>){
chomp;
next if /^#/;
my @s = split(/\t/);
next unless $s[2] eq "gene";
my $gene_id = '';
if ($s[8] =~ /^gene_id\s"(\w+)";/){
$gene_id = $1;
} else {
die "Could not obtain gene ID\n";
}
my $chr = "chr$s[0]";
if ($chr eq "chrPt"){
$chr = "chrCp";
}
if ($s[6] eq "+"){
my $left = $s[3]-$upstream;
my $right = $s[3]+$downstream;
if ($left < 0){
$left = 1;
}
print join("\t", $chr, $left, $right, $gene_id, 0, $s[6]), "\n"
} elsif ($s[6] eq "-"){
my $left = $s[4]-$downstream;
my $right = $s[4]+$upstream;;
if ($left < 0){
$left = 1;
}
print join("\t", $chr, $left, $right, $gene_id, 0, $s[6]), "\n"
} else {
die "Strand not specified\n";
}
}
close(IN);
Run the script and obtain the promoter sequences.
get_promoter.pl | gzip > Arabidopsis_thaliana.TAIR10.37.promoter.bed.gz
gunzip -c Arabidopsis_thaliana.TAIR10.37.promoter.bed.gz | head
chr1 3131 4131 AT1G01010 0 +
chr1 8630 9630 AT1G01020 0 -
chr1 10601 11601 AT1G03987 0 +
chr1 13214 14214 AT1G01030 0 -
chr1 22621 23621 AT1G01040 0 +
chr1 23599 24599 AT1G03993 0 -
chr1 28000 29000 AT1G01046 0 +
chr1 32671 33671 AT1G01050 0 -
chr1 32227 33227 AT1G03997 0 +
chr1 37371 38371 AT1G01060 0 -
gunzip -c Arabidopsis_thaliana.TAIR10.37.promoter.bed.gz |
awk '{print $1":"($2+1)"-"$3}' |
xargs samtools faidx araTha1.fa |
gzip > Arabidopsis_thaliana.TAIR10.37.promoter.fa.gz
Check out the FASTA file.
gunzip -c Arabidopsis_thaliana.TAIR10.37.promoter.fa.gz | head >chr1:3132-4131 CTCCTTTGTTTTGCTTTGGGAGGGACCCATTATTACCGCCCAGCAGCTTCCCAGCCTTCC TTTATAAGGCTTAATTTATATTTATTTAAATTTTATATGTTCTTCTATTATAATACTAAA AGGGGAATACAAATTTCTACAGAGGATGATATTCAATCCACGGTTCACCCAAACCGATTT TATAAAATTTATTATTAAATCTTTTTTAATTGTTAAATTGGTTTAAATCTGAACTCTGTT TACTTACATTGATTAAAATTCTAAACCATCATAAGTAAAAAATAATATGATTAAGACTAA TAAATCTTAATAGTTAATACTACTCGGTTTACTACATGAAATTTCATACCATCAATTGTT TTAATAATCTTTAAAATTGTTAGGACCGGTAAAACCATACCAATTAAACCGGAGATCCAT ATTAATTTAATTAAGAAAATAAAAATAAAAGGAATAAATTGTCTTATTTAAACGCTGACT TCACTGTCTTCCTCCCTCCAAATTATTAGATATACCAAACCAGAGAAAACAAATACATAA
Promoter sequence of AT1G01010 (red bar with slanted lines).
Promoter sequence of AT1G01020 (red bar with slanted lines).
Performing a gene ontology enrichment analysis
Using GOstats and all genes as the universe list; note the central ID used in org.At.tair.db are TAIR IDs instead of ENTREZ IDs.
library(org.At.tair.db)
library("GO.db")
library("GOstats")
# store all central IDs
all_gene <- keys(org.At.tair.db)
# get a random list of IDs
set.seed(31)
my_random_gene <- sample(all_gene, 100)
# set parameters for hypergeometric test
params <- new('GOHyperGParams',
geneIds=my_random_gene,
universeGeneIds=all_gene,
ontology='BP',
pvalueCutoff=0.001,
conditional=F,
testDirection='over',
annotation="org.At.tair.db"
)
my_over_test <- hyperGTest(params)
my_over_test
Gene to GO BP test for over-representation
653 GO BP ids tested (5 have p < 0.001)
Selected gene set size: 75
Gene universe size: 20901
Annotation package: org.At.tair
summary(my_over_test)
GOBPID Pvalue OddsRatio ExpCount Count Size Term
1 GO:0030641 3.802668e-05 570.54795 0.01076504 2 3 regulation of cellular pH
2 GO:0051453 3.802668e-05 570.54795 0.01076504 2 3 regulation of intracellular pH
3 GO:0006885 4.499871e-04 81.48337 0.03229511 2 9 regulation of pH
4 GO:0030004 4.499871e-04 81.48337 0.03229511 2 9 cellular monovalent inorganic cation homeostasis
5 GO:0006551 9.659419e-04 51.84309 0.04664849 2 13 leucine metabolic process
I should note that it is not straightforward on how one should correct the p-value, given the structure of gene ontology (GO) terms. The universe list should be limited to genes that were actually probed or assayed.
Percentage exonic, intronic, and intergenic
I have calculated the percent exonic, intronic, and intergenic in the Arabidopsis thaliana genome. Exonic regions make up 39.96%; intronic regions make up 15.17%; and intergenic regions make up 44.93% of the genome.
Other resources

This work is licensed under a Creative Commons
Attribution 4.0 International License.



Dear Dev Tang,
Saw your blog and am very impressed. I just started to wrok with Arabidopsis but not really on the informatic side. I want to start analysing a few mutants. I am trying to see where exactly is the T-DNA located (inserted to create knockout mutants). I see flanking sequence but am not sure where exactly is the flanking seq (I mean before or after the T-DNA insert). I would also like to know how big is the insert (when used pROK2 vector). Could you please help me with this?
There is no TAIR11. Araport11 doesn’t provide a new genome release.
Thank you for the comment. I’ve since amended my post.
Hi bro
I am using this tutorial https://github.com/CBC-UCONN/RNA-Seq-Model-Organism-Arabidopsis-thaliana, but I got stucked at the annotation part, first because I used the NCBI genome and annotation, and because the guys that made the tut didnt not put all the code or I was not capable to follow.
In this line I got a error
## to view first 20 filters in the list
head(listFilters(thale_mart),20)
Error in martCheck(mart) : object ‘thale_mart’ not found
Do you have any comment to help me ?
Thanks
In lines:
mart=useMart(“plants_mart”, host=”plants.ensembl.org”)
head(listDatasets(mart))[grep(“thaliana”,listDatasets(mart)[,1]),]
shuldn’t be
mart=useMart(“plants_mart”, host=”plants.ensembl.org”)
thale_mart = head(listDatasets(mart))[grep(“thaliana”,listDatasets(mart)[,1]),]
?
They are missing this line of code:
Hi Dave, do you know how to do the similar annotation and go analysis with petunia RNA-seq data? Not sure i can get the Org….db for petunia.
Thanks
Luke
Hi Luke,
sorry I have not worked with that organism. @PhilippBayer on Twitter may know; he’s quite busy but usually very responsive!
Cheers,
Dave
Hi Dave,
I was trying to use the perl script for getting a promoter BED file made, but I came across some issues. I do not know perl, so forgive me if I am incorrect, but the biggest issue is that it seem hardcoded to expect gene_id to be the first element in the 9th column, which is not true for the GTF that I am using.
Also it expects gene lines to be included in GTF, which I think Ensembl do, but most other GTFs do not (and the creator of gffread refuses to add gene lines to GTF, old github response I saw from them). I think that I manually fixed that issue (as well as issue of how chr is formatted in the 1st column).
Finally, it is not clear to me if this script corrects for the 0-index nature of BED vs the 1-index nature of GTF, can you confirm if it does or does not?
Many thanks,
James
Hi James,
you’re correct on all three comments. You can try the script at https://github.com/davetang/defining_genomic_regions/blob/master/script/gtf_to_bed.pl instead. Ideally, that script should ask for a dictionary of the genome so that it doesn’t create spanning regions that are longer than the actually chromosome. If you’re creating BED files with large paddings, please let me know and I’ll update that script to ask for a dict file.
Cheers,
Dave
Thanks for the quick response, Dave! I have written my own work-around for now in Python, but if that doesn’t work out, I will take a look at the linked script, many thanks!