Getting started with Arabidopsis thaliana genomics

I have started to work on Arabidopsis thaliana, as I mentioned in my last post. As noted in the Encyclopedia of life:

Arabidopsis thaliana is the most widely used model organism in plant biology. Its small genome size, fully sequenced in the year 2000, chromosome number, fast growth cycle (from seed germination to set in four to six weeks), small size (hundreds can be grown in a pot and thousands in a growth chamber), autogamous breeding system (induced mutations are expressed in two generations), and ability to grow on various synthetic media, all make the species an ideal system in experimental biology.

As indispensable tool for genomics is a genome browser and I have been a long time fan of the UCSC Genome Browser. To my initial dismay, the Arabidopsis genome was not available on the Genome Browser Gateway page. But later I found an Assembly Hub for plants, which included the TAIR10 genome assembly (the latest is TAIR11). Simply head over to the Track Data Hubs page, find "CSHL Biology of Genomes meeting 2013 demonstration assembly hub", and click connect.

Not as many available tracks as compared to the human genome browser page.

In addition, you can download the genome sequence and other useful files from the assembly hub.

# download the genome
wget http://genome-test.cse.ucsc.edu/~hiram/hubs/Plants/araTha1/araTha1.2bit

# download the chrom sizes, which is useful for BEDTools
wget http://genome-test.cse.ucsc.edu/~hiram/hubs/Plants/araTha1/araTha1.chrom.sizes

# convert to fasta
# download executables from http://genome-test.cse.ucsc.edu/~kent/exe/
twoBitToFa araTha1.2bit araTha1.fa

head araTha1.fa
>chr1
ccctaaaccctaaaccctaaaccctaaacctctgaatccttaatccctaa
atccctaaatctttaaatcctacatccatgaatccctaaatacctaattc
cctaaacCCGAAACCGGTTTCTCTGGTTGAAAATCATTGTGTATATAATG
ATAATTTTATCGTTTTTATGTAATTGCTTATTGTTGTGTGTAGATTTTTT
AAAAATATCATTTGAGGTCAATACAAATCCTATTTCTTGTGGTTTTCTTT
CCTTCACTTAGCTATGGATGGTTTATCTTCATTTGTTATATTGGATACAA
GCTTTGCTACGATCTACATTTGGGAATGTGAGTCTCTTATTGTAACCTTA
GGGTTGGTTTATCTCAAGAATCTTATTAATTGTTTGGACTGTTTATGTTT
GGACATTTATTGTCATTCTTACTCCTTTGTGGAAATGTTTGTTCTATCAA

# random FASTA file
head random.fa 
>chr3:14195741-14196603
TCAAATTCCAAGTATTTCTTTTTTTTTGGCACCGGTGTCTCCTCAGACAT
TTCAATGTCTGTTGGTGCCAAGAGGGAAAAGGGCTATTAAGCTATATAGG
GGGGTGGGTGTTGAGGGAGTCTGGGCAGTCCGTGGGAACCCCCTTTATCG
GTTCGGACTTGGGTAGCGATCGAGGGATGGTATCGGATATCGACACGAGG
AATGACCGACCGTCCGGCCGCCGGGATTTTCGCCGGAAAACTTTTCCGGC
GACTTTTCCGGCGATCGGTTTTGTTGCCTTTTTCCGAGTTTTCTCAGCAG
TTCTCGGACAAAAACTGCTGAATCGTCGAGGAGAATGGGCTTGCCTTGCG
TGGGCTGCCATTAGTTCTTCGAGGCGTTAGGGTGGCGGCGGTATAAAAGT
GTCGGAGTTTTTTCAGCAGTTCTCGGACAAAAATTGCTGAGTGGCCGAGA

# blat random FASTA file
# see https://davetang.org/muse/2012/05/15/using-blat/
blat -minScore=0 -stepSize=1 araTha1.2bit random.fa random.psl
Loaded 119667750 letters in 7 sequences
Searched 863 bases in 1 sequences

# clone my PSL to BED script
git clone https://gist.github.com/7314846.git
perl 7314846/psl_to_bed_best_score.pl random.psl random.bed
cat random.bed 
chr3    14195740        14196603        chr3:14195741-14196603  863     +       14195740        14196603        255,0,0 1       863,    0

Bioconductor

Bioconductor provides useful annotation packages for various organisms, including Arabidopsis thaliana. The annotation packages are quite easy to use; the select(), columns(), and keys() functions are used together to extract data from an AnnotationDb object. The org.At.tair.db package provides genome wide annotation for Arabidopsis.

# install if necessary
source("https://bioconductor.org/biocLite.R")
biocLite("org.At.tair.db")

# load library
library(org.At.tair.db)

# The name of an org package is always of the form org.Ab.id.db
# where Ab is a 2-letter abbreviation of the organism and
# id is an abbreviation (in lower-case) describing the type of central identifier
class(org.At.tair.db)
[1] "OrgDb"
attr(,"package")
[1] "AnnotationDbi"

# columns shows which kinds of data can be returned for the AnnotationDb object
columns(org.At.tair.db)
 [1] "ARACYC"       "ARACYCENZYME" "ENTREZID"     "ENZYME"       "EVIDENCE"     "EVIDENCEALL"  "GENENAME"    
 [8] "GO"           "GOALL"        "ONTOLOGY"     "ONTOLOGYALL"  "PATH"         "PMID"         "REFSEQ"      
[15] "SYMBOL"       "TAIR"

# map ENTREZIDs to gene symbols and TAIR IDs
# store six Entrez IDs
my_key <- head(keys(org.At.tair.db, keytype="ENTREZID"))

head(my_key)
[1] "839580" "839569" "839321" "839574" "839579" "839341"

# columns to return
my_col <- c('SYMBOL', 'TAIR')

# query!
select(org.At.tair.db, keys=my_key, columns=my_col, keytype="ENTREZID")
'select()' returned 1:many mapping between keys and columns
   ENTREZID  SYMBOL      TAIR
1    839580 ANAC001 AT1G01010
2    839580  NAC001 AT1G01010
3    839569    ARV1 AT1G01020
4    839321    NGA3 AT1G01030
5    839574    ASU1 AT1G01040
6    839574  ATDCL1 AT1G01040
7    839574     CAF AT1G01040
8    839574    DCL1 AT1G01040
9    839574   EMB60 AT1G01040
10   839574   EMB76 AT1G01040
11   839574    SIN1 AT1G01040
12   839574    SUS1 AT1G01040
13   839579  AtPPa1 AT1G01050
14   839579    PPa1 AT1G01050
15   839341     LHY AT1G01060
16   839341    LHY1 AT1G01060

# number of TAIR IDs
length(keys(org.At.tair.db, keytype = 'TAIR'))
[1] 27416

For example, the annotation package makes it easy to find potential transcription factors; I just need to find genes annotated with the gene ontology (GO) accession "GO:0003700", which is for transcription factor activity, sequence-specific DNA binding.

my_go <- 'GO:0003700'
my_go_tf <- select(org.At.tair.db, keys=my_go, columns=my_col, keytype="GO")
'select()' returned 1:many mapping between keys and columns

head(my_go_tf)
          GO EVIDENCE ONTOLOGY  SYMBOL      TAIR
1 GO:0003700      ISS       MF ANAC001 AT1G01010
2 GO:0003700      ISS       MF  NAC001 AT1G01010
3 GO:0003700      ISS       MF    NGA3 AT1G01030
4 GO:0003700      IDA       MF     LHY AT1G01060
5 GO:0003700      IDA       MF    LHY1 AT1G01060
6 GO:0003700      ISS       MF     LHY AT1G01060

# number of potential transcription factors
length(unique(my_go_tf$TAIR))
[1] 1669

I can use the GO annotation package to find gene ontologies associated with a gene of interest.

my_gene <- 'AT5G60130'
my_gene_go <- select(org.At.tair.db, keys=my_gene, columns='GO', keytype="TAIR")
'select()' returned 1:many mapping between keys and columns

my_gene_go
       TAIR         GO EVIDENCE ONTOLOGY
1 AT5G60130 GO:0003677      ISS       MF
2 AT5G60130 GO:0003700      ISS       MF
3 AT5G60130 GO:0005634      ISM       CC
4 AT5G60130 GO:0006355      ISS       BP

# install if necessary
source("https://bioconductor.org/biocLite.R")
biocLite("GO.db")
library(GO.db)

Term(my_gene_go$GO)
                                                    GO:0003677 
                                                 "DNA binding" 
                                                    GO:0003700 
"transcription factor activity, sequence-specific DNA binding" 
                                                    GO:0005634 
                                                     "nucleus" 
                                                    GO:0006355 
                  "regulation of transcription, DNA-templated"

The TxDb.Athaliana.BioMart.plantsmart28 annotation package provides coordinates for Arabidopsis thaliana genes and transcripts.

# install if necessary
source("https://bioconductor.org/biocLite.R")
biocLite("TxDb.Athaliana.BioMart.plantsmart28")
library(TxDb.Athaliana.BioMart.plantsmart28)

columns(TxDb.Athaliana.BioMart.plantsmart28)
 [1] "CDSCHROM"   "CDSEND"     "CDSID"      "CDSNAME"    "CDSSTART"   "CDSSTRAND"  "EXONCHROM"  "EXONEND"   
 [9] "EXONID"     "EXONNAME"   "EXONRANK"   "EXONSTART"  "EXONSTRAND" "GENEID"     "TXCHROM"    "TXEND"     
[17] "TXID"       "TXNAME"     "TXSTART"    "TXSTRAND"   "TXTYPE"

all_gene <- genes(TxDb.Athaliana.BioMart.plantsmart28)
all_gene
GRanges object with 33602 ranges and 1 metadata column:
            seqnames           ranges strand |     gene_id
               <Rle>        <IRanges>  <Rle> | <character>
  AT1G01010        1   [ 3631,  5899]      + |   AT1G01010
  AT1G01020        1   [ 5928,  8737]      - |   AT1G01020
  AT1G01030        1   [11649, 13714]      - |   AT1G01030
  AT1G01040        1   [23146, 31227]      + |   AT1G01040
  AT1G01046        1   [28500, 28706]      + |   AT1G01046
        ...      ...              ...    ... .         ...
  ATMG01370       Mt [360717, 361052]      - |   ATMG01370
  ATMG01380       Mt [361062, 361179]      - |   ATMG01380
  ATMG01390       Mt [361350, 363284]      - |   ATMG01390
  ATMG01400       Mt [363725, 364042]      + |   ATMG01400
  ATMG01410       Mt [366086, 366700]      - |   ATMG01410
  -------
  seqinfo: 7 sequences (1 circular) from an unspecified genome

# See GRanges vignette for a nice introduction
# http://bioconductor.org/packages/devel/bioc/vignettes/GenomicRanges/inst/doc/GenomicRangesIntroduction.pdf
seqnames(all_gene)
factor-Rle of length 33602 with 7 runs
  Lengths: 8433 5513 6730 5140 7507  133  146
  Values :    1    2    3    4    5   Pt   Mt
Levels(7): 1 2 3 4 5 Mt Pt

# find location of gene
my_gene <- 'AT5G60130'
all_gene[names(all_gene)==my_gene]
GRanges object with 1 range and 1 metadata column:
            seqnames               ranges strand |     gene_id
               <Rle>            <IRanges>  <Rle> | <character>
  AT5G60130        5 [24211871, 24213164]      - |   AT5G60130
  -------
  seqinfo: 7 sequences (1 circular) from an unspecified genome

# find transcripts for gene
my_col <- columns(TxDb.Athaliana.BioMart.plantsmart28)
select(TxDb.Athaliana.BioMart.plantsmart28,
       keys = my_gene,
       columns = my_col,
       keytype = 'GENEID')
'select()' returned 1:many mapping between keys and columns
     GENEID  CDSID CDSNAME CDSCHROM CDSSTRAND CDSSTART   CDSEND EXONID          EXONNAME EXONCHROM EXONSTRAND
1 AT5G60130 145033    <NA>        5         - 24212956 24212958 168214 AT5G60130.1.exon1         5          -
2 AT5G60130 145031    <NA>        5         - 24212216 24212854 168212 AT5G60130.1.exon2         5          -
3 AT5G60130 145029    <NA>        5         - 24211871 24212131 168210 AT5G60130.2.exon5         5          -
4 AT5G60130 145035    <NA>        5         - 24213103 24213112 168216 AT5G60130.2.exon1         5          -
5 AT5G60130 145034    <NA>        5         - 24212956 24213026 168215 AT5G60130.2.exon2         5          -
6 AT5G60130 145032    <NA>        5         - 24212390 24212854 168213 AT5G60130.2.exon3         5          -
7 AT5G60130 145030    <NA>        5         - 24212216 24212311 168211 AT5G60130.2.exon4         5          -
8 AT5G60130 145029    <NA>        5         - 24211871 24212131 168210 AT5G60130.2.exon5         5          -
  EXONSTART  EXONEND  TXID EXONRANK      TXNAME         TXTYPE TXCHROM TXSTRAND  TXSTART    TXEND
1  24212956 24212958 40825        1 AT5G60130.1 protein_coding       5        - 24211871 24212958
2  24212216 24212854 40825        2 AT5G60130.1 protein_coding       5        - 24211871 24212958
3  24211871 24212131 40825        3 AT5G60130.1 protein_coding       5        - 24211871 24212958
4  24213103 24213164 40826        1 AT5G60130.2 protein_coding       5        - 24211871 24213164
5  24212956 24213026 40826        2 AT5G60130.2 protein_coding       5        - 24211871 24213164
6  24212390 24212854 40826        3 AT5G60130.2 protein_coding       5        - 24211871 24213164
7  24212216 24212311 40826        4 AT5G60130.2 protein_coding       5        - 24211871 24213164
8  24211871 24212131 40826        5 AT5G60130.2 protein_coding       5        - 24211871 24213164

Promoter sequence

I will use the Arabidopsis thaliana genome sequence provided on the UCSC Genome Browser test site (see above); note that araTha1 corresponds to TAIR10. We will use SAMtools to extract the promoter sequence from the genome; first we need to index the FASTA file.

samtools faidx araTha1.fa

# note the 1 base start
head -2 araTha1.fa
>chr1
ccctaaaccctaaaccctaaaccctaaacctctgaatccttaatccctaa

samtools faidx araTha1.fa chr1:0-11
>chr1:0-11
ccctaaaccct

samtools faidx araTha1.fa chr1:1-11
>chr1:1-11
ccctaaaccct

Now we just need to provide the coordinates of promoter sequences to SAMtools. As far as I'm aware, there's no clear boundary as to where a promoter starts and ends; I've seen people use -300/+100 bp or -1000/+1000. For this post, I'll use -500/+500 and you can modify it as you please. However, we need the coordinates of the genes. I will use the GTF file provided by Ensembl Plants.

wget -c ftp://ftp.ensemblgenomes.org/pub/release-37/plants/gtf/arabidopsis_thaliana/Arabidopsis_thaliana.TAIR10.37.gtf.gz

gunzip -c Arabidopsis_thaliana.TAIR10.37.gtf.gz | head
#!genome-build TAIR10
#!genome-version TAIR10
#!genome-date 2010-09
#!genome-build-accession GCA_000001735.1
#!genebuild-last-updated 2010-09
1       araport11       gene    3631    5899    .       +       .       gene_id "AT1G01010"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding";
1       araport11       transcript      3631    5899    .       +       .       gene_id "AT1G01010"; transcript_id "AT1G01010.1"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding";
1       araport11       exon    3631    3913    .       +       .       gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "1"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding"; exon_id "AT1G01010.1.exon1";
1       araport11       CDS     3760    3913    .       +       0       gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "1"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding"; protein_id "AT1G01010.1"; protein_version "1";
1       araport11       start_codon     3760    3762    .       +       0       gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "1"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding";

# how many genes?
gunzip -c Arabidopsis_thaliana.TAIR10.37.gtf.gz | grep -v "^#" | awk '$3 == "gene" {print}' | perl -nle 'if (/gene_id\s"(\w+)"/) { print $1 }' | sort -u | wc -l
   33977

I would first like to check whether the coordinates are correct for araTha1. I'll download the cDNA sequences, pick a random entry (that only has one transcript), map it to araTha1 using blat, and compare the mapped coordinates to the corresponding coordinates in the GTF file.

wget -c ftp://ftp.ensemblgenomes.org/pub/release-37/plants/fasta/arabidopsis_thaliana/cdna/Arabidopsis_thaliana.TAIR10.cdna.all.fa.gz

blat araTha1.fa AT3G55270.fa AT3G55270.psl
Loaded 119667750 letters in 7 sequences
Searched 3130 bases in 1 sequences

# clone my PSL to BED script
git clone https://gist.github.com/7314846.git
perl 7314846/psl_to_bed_best_score.pl AT3G55270.psl AT3G55270.bed
Stored 1 entries

cat AT3G55270.bed
chr3    20496080        20500019        AT3G55270       3127    +       20496080        20500019        255,0,0 4       179,2493,158,300,       0,498,3271,3639

gunzip -c Arabidopsis_thaliana.TAIR10.37.gtf.gz | grep AT3G55270 | awk '$3 == "gene" {print}'
3       araport11       gene    20496081        20500019        .       +       .       gene_id "AT3G55270"; gene_name "MKP1"; gene_source "araport11"; gene_biotype "nontranslating_CDS";

The coordinates are the same; BED is 0-based while GTF is 1-based. Now, we just need to calculate the promoter coordinates by simply adding and subtracted 500 bp from the start of the gene. Note that when a gene is on the negative strand, the end coordinate is the start. Below is a short Perl script, which I have named get_promoter.pl, that creates a BED file with the promoter coordinates.

#!/usr/bin/env perl

use strict;
use warnings;

my $infile = "Arabidopsis_thaliana.TAIR10.37.gtf.gz";
my $upstream = 500;
my $downstream = 500;

open(IN, '-|', "gunzip -c $infile") || die "Could not open $infile: $!\n";
while(<IN>){
   chomp;
   next if /^#/;

   my @s = split(/\t/);
   next unless $s[2] eq "gene";

   my $gene_id = '';
   if ($s[8] =~ /^gene_id\s"(\w+)";/){
      $gene_id = $1;
   } else {
      die "Could not obtain gene ID\n";
   }

   my $chr = "chr$s[0]";
   if ($chr eq "chrPt"){
      $chr = "chrCp";
   }

   if ($s[6] eq "+"){
      my $left = $s[3]-$upstream;
      my $right = $s[3]+$downstream;
      if ($left < 0){
         $left = 1;
      }
      print join("\t", $chr, $left, $right, $gene_id, 0, $s[6]), "\n"
   } elsif ($s[6] eq "-"){
      my $left = $s[4]-$downstream;
      my $right = $s[4]+$upstream;;
      if ($left < 0){
         $left = 1;
      }
      print join("\t", $chr, $left, $right, $gene_id, 0, $s[6]), "\n"
   } else {
      die "Strand not specified\n";
   }

}
close(IN);

Run the script and obtain the promoter sequences.

get_promoter.pl | gzip > Arabidopsis_thaliana.TAIR10.37.promoter.bed.gz

gunzip -c Arabidopsis_thaliana.TAIR10.37.promoter.bed.gz | head
chr1    3131    4131    AT1G01010       0       +
chr1    8630    9630    AT1G01020       0       -
chr1    10601   11601   AT1G03987       0       +
chr1    13214   14214   AT1G01030       0       -
chr1    22621   23621   AT1G01040       0       +
chr1    23599   24599   AT1G03993       0       -
chr1    28000   29000   AT1G01046       0       +
chr1    32671   33671   AT1G01050       0       -
chr1    32227   33227   AT1G03997       0       +
chr1    37371   38371   AT1G01060       0       -

gunzip -c Arabidopsis_thaliana.TAIR10.37.promoter.bed.gz |
awk '{print $1":"($2+1)"-"$3}' |
xargs samtools faidx araTha1.fa |
gzip > Arabidopsis_thaliana.TAIR10.37.promoter.fa.gz

Check out the FASTA file.

gunzip -c Arabidopsis_thaliana.TAIR10.37.promoter.fa.gz  | head
>chr1:3132-4131
CTCCTTTGTTTTGCTTTGGGAGGGACCCATTATTACCGCCCAGCAGCTTCCCAGCCTTCC
TTTATAAGGCTTAATTTATATTTATTTAAATTTTATATGTTCTTCTATTATAATACTAAA
AGGGGAATACAAATTTCTACAGAGGATGATATTCAATCCACGGTTCACCCAAACCGATTT
TATAAAATTTATTATTAAATCTTTTTTAATTGTTAAATTGGTTTAAATCTGAACTCTGTT
TACTTACATTGATTAAAATTCTAAACCATCATAAGTAAAAAATAATATGATTAAGACTAA
TAAATCTTAATAGTTAATACTACTCGGTTTACTACATGAAATTTCATACCATCAATTGTT
TTAATAATCTTTAAAATTGTTAGGACCGGTAAAACCATACCAATTAAACCGGAGATCCAT
ATTAATTTAATTAAGAAAATAAAAATAAAAGGAATAAATTGTCTTATTTAAACGCTGACT
TCACTGTCTTCCTCCCTCCAAATTATTAGATATACCAAACCAGAGAAAACAAATACATAA

Promoter sequence of AT1G01010 (red bar with slanted lines).

Promoter sequence of AT1G01020 (red bar with slanted lines).

Performing a gene ontology enrichment analysis

Using GOstats and all genes as the universe list; note the central ID used in org.At.tair.db are TAIR IDs instead of ENTREZ IDs.

library(org.At.tair.db)
library("GO.db")
library("GOstats")

# store all central IDs
all_gene <- keys(org.At.tair.db)

# get a random list of IDs
set.seed(31)
my_random_gene <- sample(all_gene, 100)

# set parameters for hypergeometric test
params <- new('GOHyperGParams',
              geneIds=my_random_gene,
              universeGeneIds=all_gene,
              ontology='BP',
              pvalueCutoff=0.001,
              conditional=F,
              testDirection='over',
              annotation="org.At.tair.db"
)

my_over_test <- hyperGTest(params)

my_over_test
Gene to GO BP  test for over-representation 
653 GO BP ids tested (5 have p < 0.001)
Selected gene set size: 75 
    Gene universe size: 20901 
    Annotation package: org.At.tair

summary(my_over_test)
      GOBPID       Pvalue OddsRatio   ExpCount Count Size                                             Term
1 GO:0030641 3.802668e-05 570.54795 0.01076504     2    3                        regulation of cellular pH
2 GO:0051453 3.802668e-05 570.54795 0.01076504     2    3                   regulation of intracellular pH
3 GO:0006885 4.499871e-04  81.48337 0.03229511     2    9                                 regulation of pH
4 GO:0030004 4.499871e-04  81.48337 0.03229511     2    9 cellular monovalent inorganic cation homeostasis
5 GO:0006551 9.659419e-04  51.84309 0.04664849     2   13                        leucine metabolic process

I should note that it is not straightforward on how one should correct the p-value, given the structure of gene ontology (GO) terms. The universe list should be limited to genes that were actually probed or assayed.

Percentage exonic, intronic, and intergenic

I have calculated the percent exonic, intronic, and intergenic in the Arabidopsis thaliana genome. Exonic regions make up 39.96%; intronic regions make up 15.17%; and intergenic regions make up 44.93% of the genome.

Other resources

Print Friendly, PDF & Email



Creative Commons License
This work is licensed under a Creative Commons
Attribution 4.0 International License
.

Leave a Reply

Your email address will not be published. Required fields are marked *