Gene Ontology enrichment analysis

Updated: 2013 November 23rd

There are many tools available for conducting a gene ontology enrichment analysis. Some online tools that I've heard of are DAVID, PANTHER and GOrilla but I have never actually used them. Within Bioconductor there's GOstats, topGO and goseq (and probably some more). Here I'll test out GOstats and topGO.

Firstly download and install the packages if you haven't already:

source("http://bioconductor.org/biocLite.R")
#A set of annotation maps describing the entire Gene Ontology
biocLite("GO.db")
biocLite("topGO")
biocLite("GOstats")

Now let's create a toy example where my test set of genes are those that are associated with the GO term GO:0007411 (axon guidance). Let's fetch all these genes:

#install if you haven't already
source("http://bioconductor.org/biocLite.R")
biocLite("org.Hs.eg.db")
biocLite("GO.db")
library(org.Hs.eg.db)
library(GO.db)

#for reference's sake
#how to get GO terms from an Entrez gene id following the manual
#http://bioconductor.org/packages/release/data/annotation/manuals/org.Hs.eg.db/man/org.Hs.eg.db.pdf

#org.Hs.egGO is an R object that provides
#mappings between entrez gene identifers and the GO
#identifers that they are directly associated with

entrez_object <- org.Hs.egGO
# Get the entrez gene identifiers that are mapped to a GO ID
mapped_genes <- mappedkeys(entrez_object)
# Convert to a list
entrez_to_go <- as.list(entrez_object[mapped_genes])
#http://www.ncbi.nlm.nih.gov/gene/?term=1
#entrez gene id 1 is A1BG
entrez_to_go[[1]]
$`GO:0008150`
$`GO:0008150`$GOID
[1] "GO:0008150"

$`GO:0008150`$Evidence
[1] "ND"

$`GO:0008150`$Ontology
[1] "BP"


$`GO:0005576`
$`GO:0005576`$GOID
[1] "GO:0005576"

$`GO:0005576`$Evidence
[1] "IDA"

$`GO:0005576`$Ontology
[1] "CC"


$`GO:0003674`
$`GO:0003674`$GOID
[1] "GO:0003674"

$`GO:0003674`$Evidence
[1] "ND"

$`GO:0003674`$Ontology
[1] "MF"

#map GO terms to Entrez gene ids

go_object <- as.list(org.Hs.egGO2EG)
#test
go_object[1]
$`GO:0000002`
    TAS     TAS     ISS     IMP     NAS     IMP     IMP
  "291"  "1890"  "4205"  "4358"  "9361" "10000" "92667"
#same
go_object['GO:0000002']
$`GO:0000002`
    TAS     TAS     ISS     IMP     NAS     IMP     IMP
  "291"  "1890"  "4205"  "4358"  "9361" "10000" "92667"

#GO:0000002 -> mitochondrial genome maintenance
#entrez gene id 291 -> SLC25A4
#which if you look up has the GO term mitochondrial genome maintenance

#now let's get all the entrez gene ids for
#GO:0007411 (axon guidance)

axon_gene <- go_object['GO:0007411']
length(unlist(axon_gene, use.names=F))
[1] 330

length(unique(unlist(axon_gene, use.names=F)))
[1] 319
axon_gene <- unique(unlist(axon_gene, use.names=F))
head(axon_gene)
[1] "25"  "27"  "60"  "71"  "160" "161"

Now let's perform the enrichment analysis:

library("GO.db")
library("GOstats")

#as the universal list, I will use all the genes with GO terms

universe <- mapped_genes
length(axon_gene)
[1] 319
length(universe)
[1] 18105

params <- new('GOHyperGParams',
              geneIds=axon_gene,
              universeGeneIds=universe,
              ontology='BP',
              pvalueCutoff=0.001,
              conditional=F,
              testDirection='over',
              annotation="org.Hs.eg.db"
             )
hgOver <- hyperGTest(params)

hgOver
Gene to GO BP  test for over-representation 
4207 GO BP ids tested (997 have p < 0.001)
Selected gene set size: 319 
    Gene universe size: 14844 
    Annotation package: org.Hs.eg

result <- summary(hgOver)
head(result,20)
       GOBPID Pvalue OddsRatio  ExpCount Count Size
1  GO:0000902      0       Inf 21.060361   319  980
2  GO:0000904      0       Inf 15.215036   319  708
3  GO:0006928      0       Inf 30.945837   319 1440
4  GO:0006935      0       Inf 13.173471   319  613
5  GO:0007409      0       Inf 10.938494   319  509
6  GO:0007411      0       Inf  7.757949   319  361
7  GO:0009605      0       Inf 29.226624   319 1360
8  GO:0022008      0       Inf 25.723727   319 1197
9  GO:0030030      0       Inf 21.060361   319  980
10 GO:0030182      0       Inf 22.414241   319 1043
11 GO:0031175      0       Inf 15.601859   319  726
12 GO:0032989      0       Inf 22.435732   319 1044
13 GO:0032990      0       Inf 15.386958   319  716
14 GO:0040011      0       Inf 28.001684   319 1303
15 GO:0042330      0       Inf 13.173471   319  613
16 GO:0048468      0       Inf 33.847009   319 1575
17 GO:0048666      0       Inf 18.030248   319  839
18 GO:0048667      0       Inf 12.163433   319  566
19 GO:0048699      0       Inf 24.240905   319 1128
20 GO:0048812      0       Inf 12.378335   319  576
                                                    Term
1                                     cell morphogenesis
2         cell morphogenesis involved in differentiation
3                            cellular component movement
4                                             chemotaxis
5                                           axonogenesis
6                                          axon guidance
7                          response to external stimulus
8                                           neurogenesis
9                           cell projection organization
10                                neuron differentiation
11                         neuron projection development
12                      cellular component morphogenesis
13                               cell part morphogenesis
14                                            locomotion
15                                                 taxis
16                                      cell development
17                                    neuron development
18 cell morphogenesis involved in neuron differentiation
19                                 generation of neurons
20                       neuron projection morphogenesis

So as expected we get a bunch of GO terms associated to neurons and of our original GO term, axon guidance.

What if my gene list are not Entrez gene ids?

Use biomaRt. For this example I will show how you can go from Ensembl gene IDs to Entrez gene IDs:

#install if necessary
source("http://bioconductor.org/biocLite.R")
biocLite("biomaRt")
 
library("biomaRt")
ensembl <- useMart("ensembl",dataset="hsapiens_gene_ensembl")

my_chr <- c(1:22, 'M', 'X', 'Y')
my_ensembl_gene <- getBM(attributes='ensembl_gene_id',
                    filters = 'chromosome_name',
                    values = my_chr,
                    mart = ensembl)

#how many entries
length(my_ensembl_gene)
[1] 52322

five_ensembl <- my_ensembl_gene[1:5,]

five_ensembl_to_entrez <- getBM(attributes=c('ensembl_gene_id', 'entrezgene'), filters = 'ensembl_gene_id', values = five_ensembl, mart = ensembl)
five_ensembl_to_entrez
  ensembl_gene_id entrezgene
1 ENSG00000036549      26009
2 ENSG00000037637      54455
3 ENSG00000221126         NA
4 ENSG00000225011         NA
5 ENSG00000228176         NA

#out of interest how many entrez gene ids?
my_entrez_gene <- getBM(attributes='entrezgene',
                    filters = 'chromosome_name',
                    values = my_chr,
                    mart = ensembl)

length(my_entrez_gene[,1])
[1] 24231

#get some more info on the entrez_gene
my_attribute <- c('entrezgene',
                  'hgnc_symbol',
                  'chromosome_name',
                  'start_position',
                  'end_position',
                  'strand')

my_entrez_gene_info  <- getBM(attributes=my_attribute,
                        filters = c('entrezgene', 'chromosome_name'),
                        values = list(entrezgene=my_entrez_gene$entrezgene, chromosome_name=my_chr),
                        mart = ensembl)

head(my_entrez_gene_info)
  entrezgene hgnc_symbol chromosome_name start_position end_position strand
1     266919   LINC00307              21       31581469     31584101     -1
2       1652         DDT              22       24313554     24322660     -1
3  100422891     MIR4327              21       31747612     31747696     -1
4     115761       ARL11              13       50202435     50208008      1
5      64863      METTL4              18        2537524      2571508     -1
6       3704        ITPA              20        3189514      3204516      1
dim(my_entrez_gene_info)
[1] 26232     6

#entrez gene id on different chromosome locations
my_entrez_gene_info[my_entrez_gene_info$entrezgene=='728369',]
    entrezgene hgnc_symbol chromosome_name start_position end_position strand
306     728369    USP17L24               4        9326891      9328483      1
316     728369    USP17L25               4        9331637      9333229      1
327     728369    USP17L26               4        9336384      9337976      1
337     728369    USP17L27               4        9345874      9347466      1
345     728369    USP17L28               4        9350619      9352211      1
353     728369    USP17L29               4        9355364      9356956      1
363     728369    USP17L30               4        9364855      9366447      1

To learn more about the missing Entrez ID values from the Ensembl conversion see this useful post on BioStars and my post on converting gene identifiers.

Further reading

org.Hs.eg.db reference manual

GOstats vignette.

Print Friendly, PDF & Email



Creative Commons License
This work is licensed under a Creative Commons
Attribution 4.0 International License
.
10 comments Add yours
    1. When I created the post, the aim was to test topGO as well. As you've spotted, I didn't actually use it.

      Perhaps for the next update.

  1. Hi!

    Why do you use the background (universe) as the setdiff between your axon genes and all genes? I thought the right way to do it was to use all the genes annotated.

    Are you just trying to amp up your p-values, a.k.a, significance of results obtained (by using setdiff) for demo sake?

    1. Hi Shrutii,

      yes you're right; the hypergeometric-based tests requires all genes as the gene universe.

      At one stage, I was comparing the results when using different gene universes. I'll modify the post to state this clearly.

      Thanks for the comment!

      Cheers,

      Dave

  2. How to compare and categorize enrichment CO terms from multiple files?

    I need to find a way to substitute more specific child GO terms with less specific parent GO terms since my remaining vision is not good enough for comparing them by visually looking at the enriched GO terms. I have copy-pasted my latest version of the R code I am currently using at the end of this email. Could anyone please assist me in accomplishing this or connect me to somebody, who might? I only have 4 weeks left to finish my thesis. If I cannot defend by the end of March my funding will be discontinued.

    Could you please respond to me directly via email (Thomas.F.Hahn2@gmail.com) or Skype (tfh002) or cell phone (318 243 3940) because my text to speech software, on which I am depending to access electronic information due to my very limited remaining vision, cannot read out all portions of this website aloud to me? Fortunately, it is much more compatible with the G-mail and Skype interface thus allowing me to communicate much more efficiently and reliably when using those.

    I am very grateful for any kind of assistance in earning a PhD in bioinformatics despite being almost blind.

    With very warm regards,

    Thomas Hahn
    Cell phone: 318 243 3940
    Skype ID: tfh002
    Email: Thomas.F.Hahn2@gmail.com
    #################################################################################################
    ###
    ### This script combines the two subscripts that preparation and plotting of the log-fold
    ### yeast expression changes under CR and YEPD
    ###
    ### Additional (minor) changes:
    ### -
    ###
    ### Feb 20, 2015
    ###
    #################################################################################################
    message("Clean up workspace")
    rm(list=ls(all=T))

    #------------------------------------------------------------------------------------------------
    ### Generate scatter plot input file from:
    ### 1. Annotation file from Expression Lab (Affy)
    ### 2. Expression file from Expression Lab (Affy)
    ### Creates:
    ### 1. *.for.scatterplot.txt file in the folder in the current working directory
    message("Create Scatterplot input files...")
    setwd("D:/Analysis")

    ## INPUT FILES (relative or full path)
    file="Our_data_from_expression_console.txt" # file with data from expression console
    annotation = read.delim("Our_data_annotation.txt", header = F) # annotation for replicates
    ## OUTPUT directory
    my.directory = dir.scatter="2015_02_26_Erg6"
    chosen.folder = "chosen_genes"

    #######################################################
    replicates = tapply(annotation[,1], annotation[,2], list)

    normdata = read.delim(file, check.names = F)
    ####Inserted by Patrick on Thursday, February 26th to set the minimum threshold to 7.6
    adaptLowerBound <- TRUE # set to FALSE if no adaption so take place !!!
    if (adaptLowerBound) {
    tmp.colIndices <- 2:7 # column indices of expression values !!!
    lowerBound <- 7.61 # hard-coded lower boundary of expression values !!!
    normdata.orig <- normdata
    normdata.sub <- normdata[,tmp.colIndices]
    normdata.sub[which(normdata.sub <= lowerBound, arr.ind=T)] <- lowerBound
    normdata[,tmp.colIndices] 1))

    # remove the ones with multiple probes and leave only the conditions and ratios
    final.normdata=normdata[!(as.character(normdata$Systematic) %in% multiple), c("Probesets","Systematic","Common",names(replicates), write.ratios)]

    pombe = grep("^SP",as.character(final.normdata$Systematic))
    affy = grep("^AF", as.character(final.normdata$Systematic))
    remove = c(pombe, affy)
    if (length(remove)>0){
    final.normdata = final.normdata[-remove,]
    }

    normfile=gsub(".txt", ".for.scatterplot.txt", file)
    dir.create(dir.scatter, showWarnings=F)
    fn.scatterInput <- file.path(dir.scatter, normfile)
    write.table(final.normdata, fn.scatterInput, sep="\t", row.names=FALSE, col.names=TRUE, quote=F)
    message(" Done.")

    #------------------------------------------------------------------------------------------------
    ### Generate scatter plot from:
    ### 1. *.for.scatterplot.txt file in the "Scatterplot_out" folder in the current working directory (not required to be specified)
    ### 2. *.txt list of genes
    ### Creates:
    ### 1. *.for.scatterplot.txt file in the "Scatterplot_out" folder in the current working directory
    message("Create Scatterplot...")

    #setwd("D:/Tang/Patrick/Friday_Feb20_2015")

    ## INPUT FILES
    # read in the file with chosen genes from a txt file
    # give the names of the chosen files, the colors and the labels
    chosen.input.files = c(
    "Ergosterol SGD 150 genes.txt",
    "Dietary Restriction 72 genes.txt",
    "Autophagy SGD 61 genes.txt",
    "Sphingolipids 12 genes.txt",
    "ER lipid enzymes 9 genes.txt",
    "Flipases 7 genes.txt",
    "mito lipid enzymes 2 genes.txt",
    "Vacuolar Lipid Enzymes 2 genes.txt",
    "Phospholipid binding 16 genes.txt",
    "Glucose transport 9 genes.txt"
    )

    # what is the gene universe
    #universe=read.delim(file=fn.scatterInput, header=T)
    #universe$Systematic=gsub("[.]S1", "", universe$Systematic)# replaces .S1 with empty strings

    universe=read.delim(file=fn.scatterInput, header=T)
    universe$Systematic=gsub("[.]S1", "", universe$Systematic)# replaces .S1 with empty strings

    # select the columns
    yaxis="Erg6c.WTc"
    #yaxis="Our.WTc.Our.WTy"
    #yaxis="WTy"
    #yaxis = colnames(universe)[4]#sets the Y axis to the name of the last coumn
    #yaxis = colnames(universe)[ncol(universe)]#sets the Y axis to the name of the last coumn
    #yaxis=ncol(universe)
    #yaxis="Ag15c.Atg15yR1"
    yratio=1.5
    #xaxis=colnames(universe)[5]
    xaxis="Erg6y.WTy"
    #xaxis="WT.CR.2.WT.2"
    #xaxis="X2.G.12.h"

    xratio=1.5
    logarithm = TRUE # for the ratio the logarithm should be set to TRUE
    enrichment = TRUE
    my.legend = TRUE
    Brown = FALSE
    no.black = FALSE # if set to TRUE it does not do enrichment for the chosen genes. True means it will do everything, except for the chosen genes
    black.only = FALSE # put it to FALSE if you want all genes
    #Black = TRUE# it does enrichment only for the black genes (put it to FALSE to do it for everybody). False means that it enriches the over and under expressed but not the chosen

    if (enrichment == TRUE){
    library(GOstats)
    library(org.Sc.sgd.db)
    }

    exclusion.quadrant = FALSE
    xaxis.limit = 7.61
    yaxis.limit = 7.61
    exclusion.color = "violet"
    ################################
    number.of.lines = 5

    chosen.colors = c("cornflowerblue", "gray53", "seagreen4", "blue", "deeppink3", "saddlebrown", "red", "black", "blueviolet", "darkblue")
    chosen.labels = c("cornflowerblue", "gray53", "seagreen4", "blue", "deeppink3", "saddlebrown", "red", "black", "blueviolet", "darkblue")
    my.directory = "2015_02_26_Erg6"
    #####################################################################
    ### choose colors for the points
    above.line.color = "yellow"
    below.line.color = "cyan"
    handpicked.color = "saddlebrown"
    middle.genes.color = "moccasin"

    above.line.label = "red"
    below.line.label = "blue"
    handpicked.label = "burlywood4"
    middle.genes.label = "navajowhite4"

    ### choose colors for the lines
    upper.diagonal.color = "yellow"
    lower.diagonal.color = "yellow"
    left.vertical.line = "yellow"
    right.vertical.line = "yellow"
    upper.horizontal.line = "yellow"
    lower.horizontal.line = "yellow"

    # define the shapes
    shape.above.line = 17
    shape.below.line = 15

    legend.size = 0.4# font size
    point.size = 0.4# size of points in scatter plot
    show.above.names = "TRUE" #you can change it to FALSE
    show.below.names = "TRUE"
    show.middle.names = "FALSE"
    show.chosen = "TRUE"

    # define parameters for gene ontology enrichment
    hgCutoff <- 0.05
    FDR.thresh=0.1

    text.cex=0.25 # This number tells the font size of the selected genes.
    #######################################################################
    #######################################################################
    names=as.character(universe$Common)
    nocommon=which(names=="---")
    names[nocommon]=as.character(universe$Systematic[nocommon])
    # define the chosen genes and also the colors for the up and down regulated
    # read in the file with chosen genes from a txt file
    for ( i in 1:length(chosen.input.files)){
    temp = vector()
    c.name = paste("chosen", i, sep = ".")
    assign(c.name, temp)
    assign(c.name, toupper(as.character(unlist(read.delim(file=paste(chosen.folder, "/",chosen.input.files[i], sep = ""), header=F)))))

    temp2 = get(c.name)
    assign(c.name, temp2[which((temp2 %in% universe$Systematic | temp2 %in% universe$Common) )])

    # assign colors for points and labels for the chosen genes
    col.name = paste("chosen", i, "color", sep = ".")
    lab.name = paste("chosen", i, "label", sep = ".")

    assign(col.name, chosen.colors[i])
    assign(lab.name, chosen.labels[i])

    assign( paste("chosen", i, "up.color", sep = "."), chosen.colors[i])
    assign( paste("chosen", i, "down.color", sep = "."), chosen.colors[i])

    assign(paste("chosen", i, "up.label", sep = "."), chosen.labels[i])
    assign(paste("chosen", i, "down.label", sep = "."), chosen.labels[i])
    }

    dir.create(my.directory, showWarnings=F)
    directory=paste(my.directory, paste(yaxis, "vs", xaxis, "FC", yratio, "FC", xratio, "cutoff", hgCutoff, sep="_" ), sep="/")
    dir.create(directory, showWarnings=F)

    if (logarithm == TRUE){

    x.all = log2(universe[,xaxis])
    y.all = log2(universe[,yaxis])
    labs.pattern = "log2"
    intercept = 2 * log2(yratio)
    b1 = "("
    b2 = ")"

    } else {

    x.all = universe[,xaxis]
    y.all = universe[,yaxis]
    labs.pattern = b1 = b2 = ""
    intercept = log2(yratio)
    #intercept = (yratio)
    }

    keep1= x.all y.all + intercept

    ### plot the whole thing at once

    shape.all = rep(19, length(x.all))
    shape.all[keep1] = 17
    shape.all[keep2] = 15

    color.all = rep(middle.genes.color, length(x.all))
    color.all[keep1] = above.line.color
    color.all[keep2] = below.line.color

    # plot all the points
    if(logarithm==TRUE){

    plot(x.all, y.all, xlab=paste(labs.pattern,b1 ,xaxis,b2, sep=""), ylab=paste(labs.pattern,b1,yaxis,b2, sep=""),pch=shape.all, col=color.all, cex = point.size)#, xlim=c(0,log2(max(universe[,xaxis]))), ylim=c(0,log2(max(universe[,yaxis]))))
    }else{

    plot(x.all, y.all, xlab=paste(labs.pattern,b1 ,xaxis,b2, sep=""), ylab=paste(labs.pattern,b1,yaxis,b2, sep=""),pch=shape.all, col=color.all, cex = point.size)#, xlim=c(0,max(universe[,xaxis])), ylim=c(0,max(universe[,yaxis])))
    }

    xr=range(x.all)
    #xr[1]=0
    axis(1, at=round(xr[1]):round(xr[2]), labels=round(xr[1]):round(xr[2]))
    yr=range(y.all)
    #yr[1]=0
    axis(2, at=round(yr[1]):round(yr[2]), labels=round(yr[1]):round(yr[2]))
    text(round(xr[1])+2, round(yr[2])-1,paste("r = ",round(cor.test(x.all, y.all)$estimate, digit=2), sep=""))

    abline(0,1, col="tan")
    abline(intercept, 1 , col=upper.diagonal.color, lwd=2)
    abline(-intercept,1 , col=lower.diagonal.color, lwd=2)

    if (logarithm == TRUE){

    abline(h=log2(yratio), col=upper.horizontal.line, lwd=2)
    abline(h=log2(1/yratio), col=lower.horizontal.line, lwd=2)
    abline(v=log2(xratio), col=right.vertical.line, lwd=2)
    abline(v=log2(1/xratio), col=left.vertical.line, lwd=2)

    }

    points(x.all, y.all,pch=shape.all, col=color.all, cex = point.size)

    if (show.above.names == "TRUE"){
    text(x = x.all[ keep1 ], y = y.all[ keep1 ], label = names[ keep1 ], col = above.line.label, cex = text.cex, pos = 4, offset = 0.25)
    }
    if (show.below.names == "TRUE"){
    text(x = x.all[ keep2 ], y = y.all[ keep2 ], label = names[ keep2 ], col = below.line.label, cex = text.cex, pos = 4, offset = 0.25)
    }
    if (show.middle.names == "TRUE"){
    text(x = x.all[ !(keep2 | keep1) ], y = y.all[!(keep2 | keep1) ], label = names[ !(keep2 | keep1)], col = middle.genes.label, cex = text.cex, pos = 4, offset = 0.25)
    }

    if (exclusion.quadrant == TRUE){
    rect(0, yaxis.limit, xaxis.limit, yaxis.limit, border = exclusion.color)
    rect(xaxis.limit, 0 , xaxis.limit, yaxis.limit, border = exclusion.color)
    }

    if (logarithm == TRUE){
    title(main=paste(paste(yaxis, "vs.",xaxis),paste("lines: log2(",round(1/yratio, digit=2),") and log2(", yratio,")", sep=""),paste("lines: log2(",round(1/xratio, digit=2),") and log2(", xratio,")", sep=""), sep="\n"))
    } else {
    title(main=paste(yaxis, "vs.",xaxis))
    }
    if (Brown==TRUE){
    index=identify(x=x.all, y=y.all, label="", cex=text.cex, col=handpicked.color, pos = TRUE)
    points(x.all[index$ind], y.all[index$ind], col=handpicked.color, cex = point.size, pch = shape.all[index$ind])
    text(x.all[index$ind], y.all[index$ind], col=handpicked.label, label = names[index$ind], cex = text.cex, pos = 4, offset = 0.25)

    }

    # get the positions of the chosen genes and depict them on the plot (do this also for genes above and below the line)
    ###################################
    message(" Set chosen colors")
    for (i in 1:length(chosen.input.files)){

    # get the labels and colors
    col1 = get(paste("chosen", i, "color", sep="."))
    lab1 = get(paste("chosen", i, "label", sep="."))

    # get the overlap
    overlap=which(universe$Common %in% get(paste("chosen", i, sep=".")) | universe$Systematic %in% get(paste("chosen", i, sep=".")))
    if (length(get(paste("chosen", i, sep=".")))){
    points(x.all[overlap], y.all[overlap], col=col1, pch=shape.all[overlap], cex = point.size)
    if(show.chosen == "TRUE"){
    text(x.all[overlap], y.all[overlap], labels= names[overlap], pos=4, col=lab1, cex=text.cex, offset = 0.25)
    }
    ratio.chosen = log2(2^y.all[overlap]/2^x.all[overlap])
    names(ratio.chosen) = names[overlap]
    assign(paste("chosen.up", i, sep="."),names[overlap[ratio.chosen>=intercept]])
    assign(paste("chosen.down", i, sep="."),names[overlap[ratio.chosen0){
    write.table(as.data.frame(dat1), file=outfile, sep="\t", quote=F, row.names=F, col.names=F)
    write.table(as.data.frame(universe[universe$Common %in% as.character(unique(dat1)) | universe$Systematic %in% as.character(unique(dat1)),"Systematic"]), file=outfile2, sep="\t", quote=F, row.names=F, col.names=F)
    }
    }
    }

    if(my.legend == TRUE){
    legend("bottomright", bty = "n", fill = chosen.colors, legend = chosen.input.files, cex = legend.size, pt.cex = legend.size)
    }

    # write out the names of the different points

    if(Brown==TRUE){
    index.names=universe[index$ind,c("Common", "Systematic")]
    index.names$Common=as.character(index.names$Common)
    index.names$Common[index.names$Common=="---"]=as.character(index.names$Systematic[index.names$Common=="---"])
    if (nrow(index.names)>0){
    write.table(as.data.frame(index.names[,1]), file=paste(directory, "/", yaxis, "_vs_", xaxis,"_handpicked_genes_", handpicked.color, ".tsv", sep=""), sep="\t", quote=F, row.names=F, col.names=F)
    # write out systematic names
    write.table(as.data.frame(index.names[,2]), file=paste(directory, "/", yaxis, "_vs_", xaxis,"_handpicked_genes_", handpicked.color, "systematic.tsv", sep=""), sep="\t", quote=F, row.names=F, col.names=F)
    }
    }

    keep1.names=universe[keep1,c("Common", "Systematic")]
    keep1.names$Common=as.character(keep1.names$Common)
    keep1.names$Common[keep1.names$Common=="---"]=as.character(keep1.names$Systematic[keep1.names$Common=="---"])
    if (nrow(keep1.names)>0){
    write.table(as.data.frame(keep1.names[,1]), file=paste(directory, "/", yaxis, "_vs_", xaxis,"_", above.line.color, "_dots.tsv", sep=""), sep="\t", quote=F, row.names=F, col.names=F)
    write.table(as.data.frame(keep1.names[,2]), file=paste(directory, "/", yaxis, "_vs_", xaxis,"_", above.line.color, "_dots_systematic.tsv", sep=""), sep="\t", quote=F, row.names=F, col.names=F)
    }
    keep2.names=universe[keep2,c("Common", "Systematic")]
    keep2.names$Common=as.character(keep2.names$Common)
    keep2.names$Common[keep2.names$Common=="---"]=as.character(keep2.names$Systematic[keep2.names$Common=="---"])
    if (nrow(keep2.names)>0){
    write.table(as.data.frame(keep2.names[,1]), file=paste(directory, "/", yaxis, "_vs_", xaxis,"_", below.line.color, "_dots.tsv", sep=""), sep="\t", quote=F, row.names=F, col.names=F)
    write.table(as.data.frame(keep2.names[,2]), file=paste(directory, "/", yaxis, "_vs_", xaxis,"_", below.line.color, "_dots_systematic.tsv", sep=""), sep="\t", quote=F, row.names=F, col.names=F)
    }

    keep3.names=rbind(keep1.names, keep2.names)
    if (nrow(keep3.names)>0){
    write.table(as.data.frame(keep3.names[,1]), file=paste(directory, "/", yaxis, "_vs_", xaxis,"_", above.line.color, "_and_", below.line.color, "_dots.tsv", sep=""), sep="\t", quote=F, row.names=F, col.names=F)
    write.table(as.data.frame(keep3.names[,2]), file=paste(directory, "/", yaxis, "_vs_", xaxis,"_", above.line.color, "_and_", below.line.color, "_dots_systematic.tsv", sep=""), sep="\t", quote=F, row.names=F, col.names=F)
    }

    #display the numbers of upregulated and downregulated genes
    total <- dim(universe)[1]
    nup <- dim(keep1.names)[1]
    ndown <- dim(keep2.names)[1]
    text(x=-5, y=5, labels = paste("up",nup, "genes\n",
    round(nup/total, digits = 4)*100, "%", sep = " "), cex = 0.7)
    text(x=3, y=-4, labels = paste("down",ndown, "genes\n",
    round(ndown/total, digits = 4)*100, "%", sep = " "), cex = 0.7)
    text(x=-4, y=-4, labels = paste("diff",ndown+nup, "genes\n",
    round((ndown+nup)/total, digits = 4)*100, "%", sep = " "), cex = 0.7)

    my.plot=recordPlot()
    pdf(file=paste(directory, "/", yaxis, "_vs_", xaxis,"_plot.pdf", sep=""), height = 6.5, width = 6.5)
    replayPlot(my.plot)
    dev.off()

    if (enrichment == TRUE){
    message(" Calculate enrichment...")
    # do the enrichment analysis for the three groups of genes
    condition1=yaxis
    condition2=xaxis
    fold.change1=yratio
    fold.change2=xratio

    my.chosen.objects = grep("syst", grep("(color|label|files)", grep("^chosen", objects(), value = T), value = T, invert = T), invert = T, value = T)

    for (obj in my.chosen.objects){
    assign(paste(obj, "syst", sep = "."), as.character(universe[universe$Systematic %in% get(obj) | universe$Common %in% get(obj),c("Systematic")]))
    }

    reduced.data = universe[,c("Systematic", xaxis, yaxis)]
    rownames(reduced.data) = universe$Systematic
    above.line.syst=as.character(keep1.names$Systematic)
    if (exclusion.quadrant == TRUE){
    above.filter.1 = reduced.data[above.line.syst, c(xaxis)]< xaxis.limit
    above.filter.2 = reduced.data[above.line.syst, yaxis]0)>0){
    above.line.syst = above.line.syst[above.filter == 0]
    }
    }

    below.line.syst=as.character(keep2.names$Systematic)
    if (exclusion.quadrant == TRUE){
    below.filter.1 = reduced.data[below.line.syst, c(xaxis)]< xaxis.limit
    below.filter.2 = reduced.data[below.line.syst, yaxis]0)>0){
    below.line.syst = below.line.syst[below.filter == 0]
    }
    }

    above.line.and.below.line.syst=c(above.line.syst, below.line.syst)
    if (Brown==TRUE){
    handpicked.syst=as.character(index.names$Systematic)
    }
    #ontology="BP" #change ontology here (B.P, M.F, C.C)
    ####################################################
    ####################################################
    ####################################################
    if (Brown == TRUE){
    all.gene.groups = c("above.line.syst", "below.line.syst", "above.line.and.below.line.syst", "handpicked.syst")
    } else {
    all.gene.groups = c("above.line.syst", "below.line.syst", "above.line.and.below.line.syst")
    }

    # if (Black.only == TRUE){
    # all.gene.groups = c("chosen.1.syst", "chosen.2.syst")
    # }
    if (no.black == TRUE){
    all.gene.groups = all.gene.groups
    } else if (black.only == TRUE) {
    all.gene.groups = grep("chosen.*syst", objects(), value = T)
    } else {
    all.gene.groups = c(all.gene.groups,grep("chosen.*syst", objects(), value = T))
    }
    # what are the GO enrichments
    message(" What are the GO enrichments")
    for (g in all.gene.groups){
    #for (ontology in c("MF", "BP", "CC")){#Delete whatever component I don't want
    for (ontology in c("MF", "BP", "CC")){
    # what are the selected genes?
    universeIds=as.character(universe$Systematic)
    selectedGenes=get(g)

    if(length(selectedGenes) == 0){ next }

    params <- new("GOHyperGParams",
    geneIds=selectedGenes,
    universeGeneIds=universeIds,
    annotation="org.Sc.sgd.db",
    ontology=ontology,
    pvalueCutoff=hgCutoff,
    conditional=TRUE,
    testDirection="over")

    #you have to change ontology="CC"
    #to ontology="M.F" or "C.C" or "B.P"

    # make a hypergeometric test for overrepresentation
    hgOver=hyperGTest(params)

    ### retrieve genes corresponding to each GO ID
    # (GO ID, GO description, p-value)
    summaryHgOver=summary(hgOver)
    out <- summaryHgOver[,c(1,2,7)]
    # retrieve input genes associated with each GO identifier
    # use the org.Sc data mapping to get GO terms for each ID
    column=paste("GO", ontology, "ID", sep="")
    goMaps <- lapply(out[,column], function(x) unlist(mget(x, org.Sc.sgdGO2ALLORFS)))
    # it assigns each gene, which surrived the filtering, to all GO terms, which are enriched
    goSelected <- lapply(goMaps, function(x) selectedGenes[selectedGenes %in% x])

    # append a column to the table
    summaryHgOver[,ncol(summaryHgOver)+1]=do.call(rbind, lapply(goSelected, function(x) paste(unique(x), collapse=";")))

    # map yeast ORFs to common names

    common = lapply(goSelected, function(x){ y = sapply(x, function(z){ zz = as.character(universe[(universe$Systematic %in% z), "Common"] )})})
    commonNames=lapply(common, function(x){paste(unique(unlist(strsplit(unlist(x),", "))), collapse=";")})
    summaryHgOver[,ncol(summaryHgOver)+1]=do.call(rbind, commonNames)
    colnames(summaryHgOver)[(ncol(summaryHgOver)-1):ncol(summaryHgOver)] = c("Systematic", "Common")

    # write out a table
    gg=gsub("\\.syst", "", g)
    gg1 = gsub("\\.$", "", gsub("^\\.", "", unlist(strsplit(gg, "and"))))
    if (g == "handpicked.syst") {gg1 = "handpicked"}
    gene.colors = paste(sapply(paste(gg1, "color", sep="."), get), collapse = "_and_")

    ###########################
    if (grepl("chosen", gg)){
    ch.object = unlist(strsplit(gg1,"\\."))
    num = as.numeric(ch.object[length(ch.object)])
    ch.label = as.character(ch.object[length(ch.object)-1])
    ch.col = get(paste("chosen", num, "color", sep="."))
    ch.file = chosen.input.files[num]
    if (ch.label == "chosen"){
    ch.label = ""
    }
    gene.colors = paste(toupper(ch.label), "chosen_from", gsub("\\.(txt|tsv)","", ch.file), sep="_")

    }
    ########################################
    if (g == "handpicked.syst") { gene.colors = paste ("handpicked", gene.colors,sep ="_" )}
    filename=paste(directory, "/", condition1, "_FC", fold.change1,"_vs_", condition2, "_FC", fold.change2,"_", ontology, "_",hgCutoff,"_", gene.colors, "_genes.tsv", sep="")
    #print(filename)

    ## EDIT (Feb 20, 2015): Adding FDR correction
    ## patrick@jimmy.harvard.edu
    summaryHgOver <- cbind(summaryHgOver, FDR=p.adjust(summaryHgOver$Pvalue))
    dir.create(file.path(directory, "Full"), showWarnings=F, recursive=T)
    dir.create(file.path(directory, "Filtered"), showWarnings=F, recursive=T)
    dir.create(file.path(directory, "Full_with_less_columns"), showWarnings=F, recursive=T)
    dir.create(file.path(directory, "Filtered_with_less_columns"), showWarnings=F, recursive=T)
    filename.full=file.path(dirname(filename), "Full", basename(filename))
    filename.filtered=file.path(dirname(filename), "Filtered", gsub("\\.tsv$", ".filtered.tsv", basename(filename)))
    filename.full.less=file.path(dirname(filename), "Full_with_less_columns", basename(filename))
    filename.filtered.less=file.path(dirname(filename), "Filtered_with_less_columns", gsub("\\.tsv$", ".filtered.tsv", basename(filename)))
    write.table(summaryHgOver, file=filename.full, sep="\t", quote=F, row.names=F, col.names=T)
    write.table(summaryHgOver[,-c(3,8,9)], file=filename.full.less, sep="\t", quote=F, row.names=F, col.names=T)
    summaryHgOver.filtered <- summaryHgOver[summaryHgOver$FDR < FDR.thresh, , drop=F]
    write.table(summaryHgOver.filtered, file=filename.filtered, sep="\t", quote=F, row.names=F, col.names=T)
    write.table(summaryHgOver.filtered[,-c(3,8,9)], file=filename.filtered.less, sep="\t", quote=F, row.names=F, col.names=T)
    }
    }

    # concatenate the enrichment files
    message("Concatenate enrichment files...")

    search.directory = directory
    ####===================================================
    files = list.files(search.directory, pattern = "(MF|BP|CC).*genes.tsv", full.names=T)
    #files = list.files(search.directory, pattern = "BP.*genes.tsv", full.names=T)
    #############===================================================

    #####################################

    file.list = list()
    for (f in files){
    file.list[[f]] = read.delim(f, header = F)
    }

    names(file.list) = gsub("^.*\\/","", names(file.list))

    new.list = list()
    count = 1
    for (i in 1:length(file.list)){
    d = t(data.frame(c(names(file.list)[i], rep("", 4))))
    colnames(d) = paste("V", 1:5, sep="")
    if(nrow(file.list[[i]])=number.of.lines){
    new2 = rbind(new[1:(number.of.lines+2),], empty)
    } else {
    new2 = rbind(new, empty)
    }
    new.list[[i]] = new2
    }

    write.table(do.call(rbind, new.list), file=paste(search.directory, "/GO.enrichments.all.tables.", directory, ".txt", sep=""), row.names=F, col.names=F, sep="\t", quote = F)
    }
    message(" Done.")

    ## Additional: February 20, 2015, patrick@jimmy.harvard.edu
    ##

    1. Hi Thomas,

      I'm not an expert but I'll have a look through your code on the weekend and see if I can answer your question.

      Cheers,

      Dave

      1. Dear Dave,

        How are you? I thank you so very much for kindly reaching out to me by offering your assistance in helping me earning a PhD in bioinformatics despite being almost blind. I am glad you responded to my post. Since my poor vision makes it hard for me to read and write I would prefer discussing details with you by voice. Do you have Skype? If so, what is your Skype ID? Mine is tfh002. Could you please send me a Skype invite? When would be a good time for us to talk? Where are you located?

        I kind of need to have my thesis completed within the next 2 weeks. I already have R scripts. I would need help in having the software summarizing the results for me. It is too much for me to compare all relevant GO (Gene Ontology) terms visually. I also need help in comparing my findings with other publications. There are so many ways to analyze data. I am not sure which approach to use. My R script is 600 lines long and produces almost 200 enrichment files with every single run. I am feeling completely overwhelmed with data.

        Could you please respond to me directly via email Skype or phone because my text to speech software has difficulties reading out this website aloud to me. I can communicate much better in the much more familiar interface of G-mail and Skype.

        I am very much looking forward to the opportunity to talk to you by voice soon.

        I thank you very much in advance for your assistance, time, help, efforts and support.

        With very warm regards,

        Thomas Hahn
        Cell phone: 318 243 3940
        Skype ID: tfh002
        Email: Thomas.F.Hahn2@gmail.com

  3. Hi Dave,
    Excellent post.
    I don't see where the "x" object came from when you convert to a list.

    Thank you.

Leave a Reply

Your email address will not be published. Required fields are marked *