Getting started with Arabidopsis thaliana genomics

I have started to work on Arabidopsis thaliana, as I mentioned in my last post. As noted in the Encyclopedia of life:

Arabidopsis thaliana is the most widely used model organism in plant biology. Its small genome size, fully sequenced in the year 2000, chromosome number, fast growth cycle (from seed germination to set in four to six weeks), small size (hundreds can be grown in a pot and thousands in a growth chamber), autogamous breeding system (induced mutations are expressed in two generations), and ability to grow on various synthetic media, all make the species an ideal system in experimental biology.

As indispensable tool for genomics is a genome browser and I have been a long time fan of the UCSC Genome Browser. To my initial dismay, the Arabidopsis genome was not available on the Genome Browser Gateway page. But later I found an Assembly Hub for plants, which included the TAIR10 genome assembly (the latest is TAIR11). Simply head over to the Track Data Hubs page, find "CSHL Biology of Genomes meeting 2013 demonstration assembly hub", and click connect.

Not as many available tracks as compared to the human genome browser page.

In addition, you can download the genome sequence and other useful files from the assembly hub.

# download the genome
wget http://genome-test.cse.ucsc.edu/~hiram/hubs/Plants/araTha1/araTha1.2bit

# download the chrom sizes, which is useful for BEDTools
wget http://genome-test.cse.ucsc.edu/~hiram/hubs/Plants/araTha1/araTha1.chrom.sizes

# convert to fasta
# download executables from http://genome-test.cse.ucsc.edu/~kent/exe/
twoBitToFa araTha1.2bit araTha1.fa

head araTha1.fa
>chr1
ccctaaaccctaaaccctaaaccctaaacctctgaatccttaatccctaa
atccctaaatctttaaatcctacatccatgaatccctaaatacctaattc
cctaaacCCGAAACCGGTTTCTCTGGTTGAAAATCATTGTGTATATAATG
ATAATTTTATCGTTTTTATGTAATTGCTTATTGTTGTGTGTAGATTTTTT
AAAAATATCATTTGAGGTCAATACAAATCCTATTTCTTGTGGTTTTCTTT
CCTTCACTTAGCTATGGATGGTTTATCTTCATTTGTTATATTGGATACAA
GCTTTGCTACGATCTACATTTGGGAATGTGAGTCTCTTATTGTAACCTTA
GGGTTGGTTTATCTCAAGAATCTTATTAATTGTTTGGACTGTTTATGTTT
GGACATTTATTGTCATTCTTACTCCTTTGTGGAAATGTTTGTTCTATCAA

# random FASTA file
head random.fa 
>chr3:14195741-14196603
TCAAATTCCAAGTATTTCTTTTTTTTTGGCACCGGTGTCTCCTCAGACAT
TTCAATGTCTGTTGGTGCCAAGAGGGAAAAGGGCTATTAAGCTATATAGG
GGGGTGGGTGTTGAGGGAGTCTGGGCAGTCCGTGGGAACCCCCTTTATCG
GTTCGGACTTGGGTAGCGATCGAGGGATGGTATCGGATATCGACACGAGG
AATGACCGACCGTCCGGCCGCCGGGATTTTCGCCGGAAAACTTTTCCGGC
GACTTTTCCGGCGATCGGTTTTGTTGCCTTTTTCCGAGTTTTCTCAGCAG
TTCTCGGACAAAAACTGCTGAATCGTCGAGGAGAATGGGCTTGCCTTGCG
TGGGCTGCCATTAGTTCTTCGAGGCGTTAGGGTGGCGGCGGTATAAAAGT
GTCGGAGTTTTTTCAGCAGTTCTCGGACAAAAATTGCTGAGTGGCCGAGA

# blat random FASTA file
# see https://davetang.org/muse/2012/05/15/using-blat/
blat -minScore=0 -stepSize=1 araTha1.2bit random.fa random.psl
Loaded 119667750 letters in 7 sequences
Searched 863 bases in 1 sequences

# clone my PSL to BED script
git clone https://gist.github.com/7314846.git
perl 7314846/psl_to_bed_best_score.pl random.psl random.bed
cat random.bed 
chr3    14195740        14196603        chr3:14195741-14196603  863     +       14195740        14196603        255,0,0 1       863,    0

Bioconductor

Bioconductor provides useful annotation packages for various organisms, including Arabidopsis thaliana. The annotation packages are quite easy to use; the select(), columns(), and keys() functions are used together to extract data from an AnnotationDb object. The org.At.tair.db package provides genome wide annotation for Arabidopsis.

# install if necessary
source("https://bioconductor.org/biocLite.R")
biocLite("org.At.tair.db")

# load library
library(org.At.tair.db)

# The name of an org package is always of the form org.Ab.id.db
# where Ab is a 2-letter abbreviation of the organism and
# id is an abbreviation (in lower-case) describing the type of central identifier
class(org.At.tair.db)
[1] "OrgDb"
attr(,"package")
[1] "AnnotationDbi"

# columns shows which kinds of data can be returned for the AnnotationDb object
columns(org.At.tair.db)
 [1] "ARACYC"       "ARACYCENZYME" "ENTREZID"     "ENZYME"       "EVIDENCE"     "EVIDENCEALL"  "GENENAME"    
 [8] "GO"           "GOALL"        "ONTOLOGY"     "ONTOLOGYALL"  "PATH"         "PMID"         "REFSEQ"      
[15] "SYMBOL"       "TAIR"

# map ENTREZIDs to gene symbols and TAIR IDs
# store six Entrez IDs
my_key <- head(keys(org.At.tair.db, keytype="ENTREZID"))

head(my_key)
[1] "839580" "839569" "839321" "839574" "839579" "839341"

# columns to return
my_col <- c('SYMBOL', 'TAIR')

# query!
select(org.At.tair.db, keys=my_key, columns=my_col, keytype="ENTREZID")
'select()' returned 1:many mapping between keys and columns
   ENTREZID  SYMBOL      TAIR
1    839580 ANAC001 AT1G01010
2    839580  NAC001 AT1G01010
3    839569    ARV1 AT1G01020
4    839321    NGA3 AT1G01030
5    839574    ASU1 AT1G01040
6    839574  ATDCL1 AT1G01040
7    839574     CAF AT1G01040
8    839574    DCL1 AT1G01040
9    839574   EMB60 AT1G01040
10   839574   EMB76 AT1G01040
11   839574    SIN1 AT1G01040
12   839574    SUS1 AT1G01040
13   839579  AtPPa1 AT1G01050
14   839579    PPa1 AT1G01050
15   839341     LHY AT1G01060
16   839341    LHY1 AT1G01060

# number of TAIR IDs
length(keys(org.At.tair.db, keytype = 'TAIR'))
[1] 27416

For example, the annotation package makes it easy to find potential transcription factors; I just need to find genes annotated with the gene ontology (GO) accession "GO:0003700", which is for transcription factor activity, sequence-specific DNA binding.

my_go <- 'GO:0003700'
my_go_tf <- select(org.At.tair.db, keys=my_go, columns=my_col, keytype="GO")
'select()' returned 1:many mapping between keys and columns

head(my_go_tf)
          GO EVIDENCE ONTOLOGY  SYMBOL      TAIR
1 GO:0003700      ISS       MF ANAC001 AT1G01010
2 GO:0003700      ISS       MF  NAC001 AT1G01010
3 GO:0003700      ISS       MF    NGA3 AT1G01030
4 GO:0003700      IDA       MF     LHY AT1G01060
5 GO:0003700      IDA       MF    LHY1 AT1G01060
6 GO:0003700      ISS       MF     LHY AT1G01060

# number of potential transcription factors
length(unique(my_go_tf$TAIR))
[1] 1669

I can use the GO annotation package to find gene ontologies associated with a gene of interest.

my_gene <- 'AT5G60130'
my_gene_go <- select(org.At.tair.db, keys=my_gene, columns='GO', keytype="TAIR")
'select()' returned 1:many mapping between keys and columns

my_gene_go
       TAIR         GO EVIDENCE ONTOLOGY
1 AT5G60130 GO:0003677      ISS       MF
2 AT5G60130 GO:0003700      ISS       MF
3 AT5G60130 GO:0005634      ISM       CC
4 AT5G60130 GO:0006355      ISS       BP

# install if necessary
source("https://bioconductor.org/biocLite.R")
biocLite("GO.db")
library(GO.db)

Term(my_gene_go$GO)
                                                    GO:0003677 
                                                 "DNA binding" 
                                                    GO:0003700 
"transcription factor activity, sequence-specific DNA binding" 
                                                    GO:0005634 
                                                     "nucleus" 
                                                    GO:0006355 
                  "regulation of transcription, DNA-templated"

The TxDb.Athaliana.BioMart.plantsmart28 annotation package provides coordinates for Arabidopsis thaliana genes and transcripts.

# install if necessary
source("https://bioconductor.org/biocLite.R")
biocLite("TxDb.Athaliana.BioMart.plantsmart28")
library(TxDb.Athaliana.BioMart.plantsmart28)

columns(TxDb.Athaliana.BioMart.plantsmart28)
 [1] "CDSCHROM"   "CDSEND"     "CDSID"      "CDSNAME"    "CDSSTART"   "CDSSTRAND"  "EXONCHROM"  "EXONEND"   
 [9] "EXONID"     "EXONNAME"   "EXONRANK"   "EXONSTART"  "EXONSTRAND" "GENEID"     "TXCHROM"    "TXEND"     
[17] "TXID"       "TXNAME"     "TXSTART"    "TXSTRAND"   "TXTYPE"

all_gene <- genes(TxDb.Athaliana.BioMart.plantsmart28)
all_gene
GRanges object with 33602 ranges and 1 metadata column:
            seqnames           ranges strand |     gene_id
               <Rle>        <IRanges>  <Rle> | <character>
  AT1G01010        1   [ 3631,  5899]      + |   AT1G01010
  AT1G01020        1   [ 5928,  8737]      - |   AT1G01020
  AT1G01030        1   [11649, 13714]      - |   AT1G01030
  AT1G01040        1   [23146, 31227]      + |   AT1G01040
  AT1G01046        1   [28500, 28706]      + |   AT1G01046
        ...      ...              ...    ... .         ...
  ATMG01370       Mt [360717, 361052]      - |   ATMG01370
  ATMG01380       Mt [361062, 361179]      - |   ATMG01380
  ATMG01390       Mt [361350, 363284]      - |   ATMG01390
  ATMG01400       Mt [363725, 364042]      + |   ATMG01400
  ATMG01410       Mt [366086, 366700]      - |   ATMG01410
  -------
  seqinfo: 7 sequences (1 circular) from an unspecified genome

# See GRanges vignette for a nice introduction
# http://bioconductor.org/packages/devel/bioc/vignettes/GenomicRanges/inst/doc/GenomicRangesIntroduction.pdf
seqnames(all_gene)
factor-Rle of length 33602 with 7 runs
  Lengths: 8433 5513 6730 5140 7507  133  146
  Values :    1    2    3    4    5   Pt   Mt
Levels(7): 1 2 3 4 5 Mt Pt

# find location of gene
my_gene <- 'AT5G60130'
all_gene[names(all_gene)==my_gene]
GRanges object with 1 range and 1 metadata column:
            seqnames               ranges strand |     gene_id
               <Rle>            <IRanges>  <Rle> | <character>
  AT5G60130        5 [24211871, 24213164]      - |   AT5G60130
  -------
  seqinfo: 7 sequences (1 circular) from an unspecified genome

# find transcripts for gene
my_col <- columns(TxDb.Athaliana.BioMart.plantsmart28)
select(TxDb.Athaliana.BioMart.plantsmart28,
       keys = my_gene,
       columns = my_col,
       keytype = 'GENEID')
'select()' returned 1:many mapping between keys and columns
     GENEID  CDSID CDSNAME CDSCHROM CDSSTRAND CDSSTART   CDSEND EXONID          EXONNAME EXONCHROM EXONSTRAND
1 AT5G60130 145033    <NA>        5         - 24212956 24212958 168214 AT5G60130.1.exon1         5          -
2 AT5G60130 145031    <NA>        5         - 24212216 24212854 168212 AT5G60130.1.exon2         5          -
3 AT5G60130 145029    <NA>        5         - 24211871 24212131 168210 AT5G60130.2.exon5         5          -
4 AT5G60130 145035    <NA>        5         - 24213103 24213112 168216 AT5G60130.2.exon1         5          -
5 AT5G60130 145034    <NA>        5         - 24212956 24213026 168215 AT5G60130.2.exon2         5          -
6 AT5G60130 145032    <NA>        5         - 24212390 24212854 168213 AT5G60130.2.exon3         5          -
7 AT5G60130 145030    <NA>        5         - 24212216 24212311 168211 AT5G60130.2.exon4         5          -
8 AT5G60130 145029    <NA>        5         - 24211871 24212131 168210 AT5G60130.2.exon5         5          -
  EXONSTART  EXONEND  TXID EXONRANK      TXNAME         TXTYPE TXCHROM TXSTRAND  TXSTART    TXEND
1  24212956 24212958 40825        1 AT5G60130.1 protein_coding       5        - 24211871 24212958
2  24212216 24212854 40825        2 AT5G60130.1 protein_coding       5        - 24211871 24212958
3  24211871 24212131 40825        3 AT5G60130.1 protein_coding       5        - 24211871 24212958
4  24213103 24213164 40826        1 AT5G60130.2 protein_coding       5        - 24211871 24213164
5  24212956 24213026 40826        2 AT5G60130.2 protein_coding       5        - 24211871 24213164
6  24212390 24212854 40826        3 AT5G60130.2 protein_coding       5        - 24211871 24213164
7  24212216 24212311 40826        4 AT5G60130.2 protein_coding       5        - 24211871 24213164
8  24211871 24212131 40826        5 AT5G60130.2 protein_coding       5        - 24211871 24213164

Performing a gene ontology enrichment analysis

Using GOstats and all genes as the universe list; note the central ID used in org.At.tair.db are TAIR IDs instead of ENTREZ IDs.

library(org.At.tair.db)
library("GO.db")
library("GOstats")

# store all central IDs
all_gene <- keys(org.At.tair.db)

# get a random list of IDs
set.seed(31)
my_random_gene <- sample(all_gene, 100)

# set parameters for hypergeometric test
params <- new('GOHyperGParams',
              geneIds=my_random_gene,
              universeGeneIds=all_gene,
              ontology='BP',
              pvalueCutoff=0.001,
              conditional=F,
              testDirection='over',
              annotation="org.At.tair.db"
)

my_over_test <- hyperGTest(params)

my_over_test
Gene to GO BP  test for over-representation 
653 GO BP ids tested (5 have p < 0.001)
Selected gene set size: 75 
    Gene universe size: 20901 
    Annotation package: org.At.tair

summary(my_over_test)
      GOBPID       Pvalue OddsRatio   ExpCount Count Size                                             Term
1 GO:0030641 3.802668e-05 570.54795 0.01076504     2    3                        regulation of cellular pH
2 GO:0051453 3.802668e-05 570.54795 0.01076504     2    3                   regulation of intracellular pH
3 GO:0006885 4.499871e-04  81.48337 0.03229511     2    9                                 regulation of pH
4 GO:0030004 4.499871e-04  81.48337 0.03229511     2    9 cellular monovalent inorganic cation homeostasis
5 GO:0006551 9.659419e-04  51.84309 0.04664849     2   13                        leucine metabolic process

I should note that it is not straightforward on how one should correct the p-value, given the structure of gene ontology (GO) terms. The universe list should be limited to genes that were actually probed or assayed.

Percentage exonic, intronic, and intergenic

I have calculated the percent exonic, intronic, and intergenic in the Arabidopsis thaliana genome. Exonic regions make up 39.96%; intronic regions make up 15.17%; and intergenic regions make up 44.93% of the genome.

Other resources

Print Friendly, PDF & Email



Creative Commons License
This work is licensed under a Creative Commons
Attribution 4.0 International License
.

Leave a Reply

Your email address will not be published. Required fields are marked *