Gene Ontology enrichment analysis

Updated: 2019 March 24th

The Gene Ontology Enrichment Analysis is a popular type of analysis that is carried out after a differential gene expression analysis has been carried out. There are many tools available for performing a gene ontology enrichment analysis. Online tools include DAVID, PANTHER and GOrilla. Bioconductor pacakges include GOstats, topGO and goseq.

In this post, I'll go through the GOstats package. Firstly download and install some packages that will be used in this post.

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("GOstats")

# this should be installed when you install GOstats
# install GO.db too, which is
# A set of annotation maps describing the entire Gene Ontology
BiocManager::install("GO.db")

# Genome wide annotation for Human, primarily based on mapping using Entrez Gene identifiers.
BiocManager::install("org.Hs.eg.db")

Now let's create a toy example where my test set of genes are those that are associated with the GO term GO:0007411 (axon guidance). We can use the org.Hs.eg.db package to fetch all genes associated with axon guidance.

library(org.Hs.eg.db)

# for reference's sake
# how to get GO terms from an Entrez gene id following the manual
# http://bioconductor.org/packages/release/data/annotation/manuals/org.Hs.eg.db/man/org.Hs.eg.db.pdf

# org.Hs.egGO is an R object that provides
# mappings between Entrez gene identifers and the GO
# identifers that they are directly associated with
entrez_object <- org.Hs.egGO

class(entrez_object)
[1] "Go3AnnDbBimap"
attr(,"package")
[1] "AnnotationDbi"

# Get the Entrez gene IDs that are mapped to a GO ID
mapped_genes <- mappedkeys(entrez_object)

# Entrez gene IDs
head(mapped_genes)
[1] "1"     "10"    "100"   "1000"  "10000" "10001"

# map GO IDs to Entrez IDs and convert to a list
entrez_to_go <- as.list(entrez_object[mapped_genes])

length(entrez_to_go)
[1] 20408

# http://www.ncbi.nlm.nih.gov/gene/?term=1
# entrez gene id 1 is A1BG
# below are the GO terms associated with the gene
sapply(entrez_to_go[[1]], function(x) x)
         GO:0002576   GO:0008150   GO:0043312   GO:0005576   GO:0005576   GO:0005576   GO:0005615  
GOID     "GO:0002576" "GO:0008150" "GO:0043312" "GO:0005576" "GO:0005576" "GO:0005576" "GO:0005615"
Evidence "TAS"        "ND"         "TAS"        "HDA"        "IDA"        "TAS"        "HDA"       
Ontology "BP"         "BP"         "BP"         "CC"         "CC"         "CC"         "CC"        
         GO:0031093   GO:0034774   GO:0062023   GO:0070062   GO:0072562   GO:1904813   GO:0003674  
GOID     "GO:0031093" "GO:0034774" "GO:0062023" "GO:0070062" "GO:0072562" "GO:1904813" "GO:0003674"
Evidence "TAS"        "TAS"        "HDA"        "HDA"        "HDA"        "TAS"        "ND"        
Ontology "CC"         "CC"         "CC"         "CC"         "CC"         "CC"         "MF"

# perform the reverse and map GO terms to Entrez gene ids
go_object <- as.list(org.Hs.egGO2EG)

length(go_object)
[1] 17983

# GO:0000002 -> mitochondrial genome maintenance
# entrez gene id 291 -> SLC25A4
# which if you look up has the GO term mitochondrial genome maintenance
go_object[1]
$`GO:0000002`
    TAS     IMP     ISS     IMP     IMP     NAS     IMP     IBA     IDA     IEA     IDA     IMP 
  "291"  "1890"  "4205"  "4358"  "4976"  "9361" "10000" "55154" "55186" "80119" "84275" "92667"

# now let's get all the entrez gene ids for
# GO:0007411 (axon guidance)

axon_gene <- go_object['GO:0007411']
length(unlist(axon_gene, use.names=FALSE))
[1] 201

# not all genes are unique
length(unique(unlist(axon_gene, use.names=FALSE)))
[1] 191

# use only unique genes
axon_gene <- unique(unlist(axon_gene, use.names=FALSE))
head(axon_gene)
[1] "323"  "474"  "627"  "655"  "682"  "1002"

Now let's perform the enrichment analysis; note that I use all genes with GO terms as the universe list. This list should be set to the genes that were actually assayed in your set of experiments. In addition, I am only testing for biological process terms that are over-represented and selecting terms that have a p-value of 0.001 or less.

library("GO.db")
library("GOstats")

# I will use all the genes with GO terms as the universe
universe <- mapped_genes

length(universe)
[1] 20408

length(axon_gene)
[1] 191

params <- new('GOHyperGParams',
              geneIds = axon_gene,
              universeGeneIds = universe,
              ontology = 'BP',
              pvalueCutoff = 0.001,
              conditional = FALSE,
              testDirection = 'over',
              annotation = "org.Hs.eg.db"
             )

hgOver <- hyperGTest(params)

hgOver
Gene to GO BP  test for over-representation 
4617 GO BP ids tested (1045 have p < 0.001)
Selected gene set size: 191 
    Gene universe size: 18493 
    Annotation package: org.Hs.eg

result <- summary(hgOver)
      GOBPID        Pvalue OddsRatio ExpCount Count Size
1 GO:0007409  0.000000e+00       Inf 4.637376   191  449
2 GO:0007411  0.000000e+00       Inf 2.675012   191  259
3 GO:0097485  0.000000e+00       Inf 2.685340   191  260
4 GO:0061564 1.572265e-319       Inf 5.091819   191  493
5 GO:0048667 1.383219e-306       Inf 5.773482   191  559
6 GO:0006935 2.278322e-296       Inf 6.393176   191  619
                                                   Term
1                                          axonogenesis
2                                         axon guidance
3                            neuron projection guidance
4                                      axon development
5 cell morphogenesis involved in neuron differentiation
6                                            chemotaxis

As expected, GO terms associated to axons are enriched. Note that the Size for GO:0007411 is 259 and not 191; we had selected all genes that were associated with GO:0007411 (axon guidance), so I was expecting the Count and the Size to be identical. Perhaps we didn't select all genes associated with axon guidance.

What if my gene list are not Entrez gene ids?

For this example, I will demonstrate how to convert Ensembl gene IDs to Entrez gene IDs using biomaRt.

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("biomaRt", version = "3.8")
 
library("biomaRt")
ensembl <- useMart("ensembl", dataset="hsapiens_gene_ensembl")

my_chr <- c(1:22, 'M', 'X', 'Y')
my_ensembl_gene <- getBM(attributes='ensembl_gene_id',
                    filters = 'chromosome_name',
                    values = my_chr,
                    mart = ensembl)

dim(my_ensembl_gene)
[1] 60587     1

five_ensembl <- my_ensembl_gene[1:5,]

five_ensembl_to_entrez <- getBM(attributes=c('ensembl_gene_id', 'entrezgene_id'),
                                filters = 'ensembl_gene_id',
                                values = five_ensembl,
                                mart = ensembl)

# not all Ensembl gene IDs have Entrez IDs
five_ensembl_to_entrez
  ensembl_gene_id entrezgene_id
1 ENSG00000206779            NA
2 ENSG00000207256            NA
3 ENSG00000221643            NA
4 ENSG00000252151            NA
5 ENSG00000252682            NA

# out of interest how many entrez gene ids?
my_entrez_gene <- getBM(attributes='entrezgene_id',
                    filters = 'chromosome_name',
                    values = my_chr,
                    mart = ensembl)

dim(my_entrez_gene)
[1] 19399     1

# get some more info on the entrez_gene
my_attribute <- c('entrezgene_id',
                  'hgnc_symbol',
                  'chromosome_name',
                  'start_position',
                  'end_position',
                  'strand')

my_entrez_gene_info  <- getBM(attributes=my_attribute,
                        filters = c('entrezgene_id', 'chromosome_name'),
                        values = list(entrezgene=my_entrez_gene$entrezgene_id, chromosome_name=my_chr),
                        mart = ensembl)

head(my_entrez_gene_info)
  entrezgene_id hgnc_symbol chromosome_name start_position end_position strand
1         23612      PHLDA3               1      201464278    201469237     -1
2          4179        CD46               1      207752054    207795513      1
3          9909     DENND4B               1      153929501    153946718     -1
4         57459     GATAD2B               1      153789030    153923360     -1
5         54499       TMCO1               1      165724293    165827755     -1
6          7371        UCK2               1      165827614    165911618      1

dim(my_entrez_gene_info)
[1] 19495     6

To learn more about the missing Entrez ID values from the Ensembl conversion see this useful post on BioStars and my post on converting gene identifiers.

Further reading

org.Hs.eg.db reference manual

GOstats vignette.

Print Friendly, PDF & Email



Creative Commons License
This work is licensed under a Creative Commons
Attribution 4.0 International License
.
11 comments Add yours
    1. When I created the post, the aim was to test topGO as well. As you’ve spotted, I didn’t actually use it.

      Perhaps for the next update.

  1. Hi!

    Why do you use the background (universe) as the setdiff between your axon genes and all genes? I thought the right way to do it was to use all the genes annotated.

    Are you just trying to amp up your p-values, a.k.a, significance of results obtained (by using setdiff) for demo sake?

    1. Hi Shrutii,

      yes you’re right; the hypergeometric-based tests requires all genes as the gene universe.

      At one stage, I was comparing the results when using different gene universes. I’ll modify the post to state this clearly.

      Thanks for the comment!

      Cheers,

      Dave

  2. How to compare and categorize enrichment CO terms from multiple files?

    I need to find a way to substitute more specific child GO terms with less specific parent GO terms since my remaining vision is not good enough for comparing them by visually looking at the enriched GO terms. I have copy-pasted my latest version of the R code I am currently using at the end of this email. Could anyone please assist me in accomplishing this or connect me to somebody, who might? I only have 4 weeks left to finish my thesis. If I cannot defend by the end of March my funding will be discontinued.

    Could you please respond to me directly via email (Thomas.F.Hahn2@gmail.com) or Skype (tfh002) or cell phone (318 243 3940) because my text to speech software, on which I am depending to access electronic information due to my very limited remaining vision, cannot read out all portions of this website aloud to me? Fortunately, it is much more compatible with the G-mail and Skype interface thus allowing me to communicate much more efficiently and reliably when using those.

    I am very grateful for any kind of assistance in earning a PhD in bioinformatics despite being almost blind.

    With very warm regards,

    Thomas Hahn
    Cell phone: 318 243 3940
    Skype ID: tfh002
    Email: Thomas.F.Hahn2@gmail.com
    #################################################################################################
    ###
    ### This script combines the two subscripts that preparation and plotting of the log-fold
    ### yeast expression changes under CR and YEPD
    ###
    ### Additional (minor) changes:
    ### –
    ###
    ### Feb 20, 2015
    ###
    #################################################################################################
    message(“Clean up workspace”)
    rm(list=ls(all=T))

    #————————————————————————————————
    ### Generate scatter plot input file from:
    ### 1. Annotation file from Expression Lab (Affy)
    ### 2. Expression file from Expression Lab (Affy)
    ### Creates:
    ### 1. *.for.scatterplot.txt file in the folder in the current working directory
    message(“Create Scatterplot input files…”)
    setwd(“D:/Analysis”)

    ## INPUT FILES (relative or full path)
    file=”Our_data_from_expression_console.txt” # file with data from expression console
    annotation = read.delim(“Our_data_annotation.txt”, header = F) # annotation for replicates
    ## OUTPUT directory
    my.directory = dir.scatter=”2015_02_26_Erg6″
    chosen.folder = “chosen_genes”

    #######################################################
    replicates = tapply(annotation[,1], annotation[,2], list)

    normdata = read.delim(file, check.names = F)
    ####Inserted by Patrick on Thursday, February 26th to set the minimum threshold to 7.6
    adaptLowerBound <- TRUE # set to FALSE if no adaption so take place !!!
    if (adaptLowerBound) {
    tmp.colIndices <- 2:7 # column indices of expression values !!!
    lowerBound <- 7.61 # hard-coded lower boundary of expression values !!!
    normdata.orig <- normdata
    normdata.sub <- normdata[,tmp.colIndices]
    normdata.sub[which(normdata.sub <= lowerBound, arr.ind=T)] <- lowerBound
    normdata[,tmp.colIndices] 1))

    # remove the ones with multiple probes and leave only the conditions and ratios
    final.normdata=normdata[!(as.character(normdata$Systematic) %in% multiple), c(“Probesets”,”Systematic”,”Common”,names(replicates), write.ratios)]

    pombe = grep(“^SP”,as.character(final.normdata$Systematic))
    affy = grep(“^AF”, as.character(final.normdata$Systematic))
    remove = c(pombe, affy)
    if (length(remove)>0){
    final.normdata = final.normdata[-remove,]
    }

    normfile=gsub(“.txt”, “.for.scatterplot.txt”, file)
    dir.create(dir.scatter, showWarnings=F)
    fn.scatterInput <- file.path(dir.scatter, normfile)
    write.table(final.normdata, fn.scatterInput, sep="\t", row.names=FALSE, col.names=TRUE, quote=F)
    message(" Done.")

    #————————————————————————————————
    ### Generate scatter plot from:
    ### 1. *.for.scatterplot.txt file in the "Scatterplot_out" folder in the current working directory (not required to be specified)
    ### 2. *.txt list of genes
    ### Creates:
    ### 1. *.for.scatterplot.txt file in the "Scatterplot_out" folder in the current working directory
    message("Create Scatterplot…")

    #setwd("D:/Tang/Patrick/Friday_Feb20_2015")

    ## INPUT FILES
    # read in the file with chosen genes from a txt file
    # give the names of the chosen files, the colors and the labels
    chosen.input.files = c(
    "Ergosterol SGD 150 genes.txt",
    "Dietary Restriction 72 genes.txt",
    "Autophagy SGD 61 genes.txt",
    "Sphingolipids 12 genes.txt",
    "ER lipid enzymes 9 genes.txt",
    "Flipases 7 genes.txt",
    "mito lipid enzymes 2 genes.txt",
    "Vacuolar Lipid Enzymes 2 genes.txt",
    "Phospholipid binding 16 genes.txt",
    "Glucose transport 9 genes.txt"
    )

    # what is the gene universe
    #universe=read.delim(file=fn.scatterInput, header=T)
    #universe$Systematic=gsub("[.]S1", "", universe$Systematic)# replaces .S1 with empty strings

    universe=read.delim(file=fn.scatterInput, header=T)
    universe$Systematic=gsub("[.]S1", "", universe$Systematic)# replaces .S1 with empty strings

    # select the columns
    yaxis="Erg6c.WTc"
    #yaxis="Our.WTc.Our.WTy"
    #yaxis="WTy"
    #yaxis = colnames(universe)[4]#sets the Y axis to the name of the last coumn
    #yaxis = colnames(universe)[ncol(universe)]#sets the Y axis to the name of the last coumn
    #yaxis=ncol(universe)
    #yaxis="Ag15c.Atg15yR1"
    yratio=1.5
    #xaxis=colnames(universe)[5]
    xaxis="Erg6y.WTy"
    #xaxis="WT.CR.2.WT.2"
    #xaxis="X2.G.12.h"

    xratio=1.5
    logarithm = TRUE # for the ratio the logarithm should be set to TRUE
    enrichment = TRUE
    my.legend = TRUE
    Brown = FALSE
    no.black = FALSE # if set to TRUE it does not do enrichment for the chosen genes. True means it will do everything, except for the chosen genes
    black.only = FALSE # put it to FALSE if you want all genes
    #Black = TRUE# it does enrichment only for the black genes (put it to FALSE to do it for everybody). False means that it enriches the over and under expressed but not the chosen

    if (enrichment == TRUE){
    library(GOstats)
    library(org.Sc.sgd.db)
    }

    exclusion.quadrant = FALSE
    xaxis.limit = 7.61
    yaxis.limit = 7.61
    exclusion.color = "violet"
    ################################
    number.of.lines = 5

    chosen.colors = c("cornflowerblue", "gray53", "seagreen4", "blue", "deeppink3", "saddlebrown", "red", "black", "blueviolet", "darkblue")
    chosen.labels = c("cornflowerblue", "gray53", "seagreen4", "blue", "deeppink3", "saddlebrown", "red", "black", "blueviolet", "darkblue")
    my.directory = "2015_02_26_Erg6"
    #####################################################################
    ### choose colors for the points
    above.line.color = "yellow"
    below.line.color = "cyan"
    handpicked.color = "saddlebrown"
    middle.genes.color = "moccasin"

    above.line.label = "red"
    below.line.label = "blue"
    handpicked.label = "burlywood4"
    middle.genes.label = "navajowhite4"

    ### choose colors for the lines
    upper.diagonal.color = "yellow"
    lower.diagonal.color = "yellow"
    left.vertical.line = "yellow"
    right.vertical.line = "yellow"
    upper.horizontal.line = "yellow"
    lower.horizontal.line = "yellow"

    # define the shapes
    shape.above.line = 17
    shape.below.line = 15

    legend.size = 0.4# font size
    point.size = 0.4# size of points in scatter plot
    show.above.names = "TRUE" #you can change it to FALSE
    show.below.names = "TRUE"
    show.middle.names = "FALSE"
    show.chosen = "TRUE"

    # define parameters for gene ontology enrichment
    hgCutoff <- 0.05
    FDR.thresh=0.1

    text.cex=0.25 # This number tells the font size of the selected genes.
    #######################################################################
    #######################################################################
    names=as.character(universe$Common)
    nocommon=which(names=="—")
    names[nocommon]=as.character(universe$Systematic[nocommon])
    # define the chosen genes and also the colors for the up and down regulated
    # read in the file with chosen genes from a txt file
    for ( i in 1:length(chosen.input.files)){
    temp = vector()
    c.name = paste("chosen", i, sep = ".")
    assign(c.name, temp)
    assign(c.name, toupper(as.character(unlist(read.delim(file=paste(chosen.folder, "/",chosen.input.files[i], sep = ""), header=F)))))

    temp2 = get(c.name)
    assign(c.name, temp2[which((temp2 %in% universe$Systematic | temp2 %in% universe$Common) )])

    # assign colors for points and labels for the chosen genes
    col.name = paste("chosen", i, "color", sep = ".")
    lab.name = paste("chosen", i, "label", sep = ".")

    assign(col.name, chosen.colors[i])
    assign(lab.name, chosen.labels[i])

    assign( paste("chosen", i, "up.color", sep = "."), chosen.colors[i])
    assign( paste("chosen", i, "down.color", sep = "."), chosen.colors[i])

    assign(paste("chosen", i, "up.label", sep = "."), chosen.labels[i])
    assign(paste("chosen", i, "down.label", sep = "."), chosen.labels[i])
    }

    dir.create(my.directory, showWarnings=F)
    directory=paste(my.directory, paste(yaxis, "vs", xaxis, "FC", yratio, "FC", xratio, "cutoff", hgCutoff, sep="_" ), sep="/")
    dir.create(directory, showWarnings=F)

    if (logarithm == TRUE){

    x.all = log2(universe[,xaxis])
    y.all = log2(universe[,yaxis])
    labs.pattern = "log2"
    intercept = 2 * log2(yratio)
    b1 = "("
    b2 = ")"

    } else {

    x.all = universe[,xaxis]
    y.all = universe[,yaxis]
    labs.pattern = b1 = b2 = ""
    intercept = log2(yratio)
    #intercept = (yratio)
    }

    keep1= x.all y.all + intercept

    ### plot the whole thing at once

    shape.all = rep(19, length(x.all))
    shape.all[keep1] = 17
    shape.all[keep2] = 15

    color.all = rep(middle.genes.color, length(x.all))
    color.all[keep1] = above.line.color
    color.all[keep2] = below.line.color

    # plot all the points
    if(logarithm==TRUE){

    plot(x.all, y.all, xlab=paste(labs.pattern,b1 ,xaxis,b2, sep=””), ylab=paste(labs.pattern,b1,yaxis,b2, sep=””),pch=shape.all, col=color.all, cex = point.size)#, xlim=c(0,log2(max(universe[,xaxis]))), ylim=c(0,log2(max(universe[,yaxis]))))
    }else{

    plot(x.all, y.all, xlab=paste(labs.pattern,b1 ,xaxis,b2, sep=””), ylab=paste(labs.pattern,b1,yaxis,b2, sep=””),pch=shape.all, col=color.all, cex = point.size)#, xlim=c(0,max(universe[,xaxis])), ylim=c(0,max(universe[,yaxis])))
    }

    xr=range(x.all)
    #xr[1]=0
    axis(1, at=round(xr[1]):round(xr[2]), labels=round(xr[1]):round(xr[2]))
    yr=range(y.all)
    #yr[1]=0
    axis(2, at=round(yr[1]):round(yr[2]), labels=round(yr[1]):round(yr[2]))
    text(round(xr[1])+2, round(yr[2])-1,paste(“r = “,round(cor.test(x.all, y.all)$estimate, digit=2), sep=””))

    abline(0,1, col=”tan”)
    abline(intercept, 1 , col=upper.diagonal.color, lwd=2)
    abline(-intercept,1 , col=lower.diagonal.color, lwd=2)

    if (logarithm == TRUE){

    abline(h=log2(yratio), col=upper.horizontal.line, lwd=2)
    abline(h=log2(1/yratio), col=lower.horizontal.line, lwd=2)
    abline(v=log2(xratio), col=right.vertical.line, lwd=2)
    abline(v=log2(1/xratio), col=left.vertical.line, lwd=2)

    }

    points(x.all, y.all,pch=shape.all, col=color.all, cex = point.size)

    if (show.above.names == “TRUE”){
    text(x = x.all[ keep1 ], y = y.all[ keep1 ], label = names[ keep1 ], col = above.line.label, cex = text.cex, pos = 4, offset = 0.25)
    }
    if (show.below.names == “TRUE”){
    text(x = x.all[ keep2 ], y = y.all[ keep2 ], label = names[ keep2 ], col = below.line.label, cex = text.cex, pos = 4, offset = 0.25)
    }
    if (show.middle.names == “TRUE”){
    text(x = x.all[ !(keep2 | keep1) ], y = y.all[!(keep2 | keep1) ], label = names[ !(keep2 | keep1)], col = middle.genes.label, cex = text.cex, pos = 4, offset = 0.25)
    }

    if (exclusion.quadrant == TRUE){
    rect(0, yaxis.limit, xaxis.limit, yaxis.limit, border = exclusion.color)
    rect(xaxis.limit, 0 , xaxis.limit, yaxis.limit, border = exclusion.color)
    }

    if (logarithm == TRUE){
    title(main=paste(paste(yaxis, “vs.”,xaxis),paste(“lines: log2(“,round(1/yratio, digit=2),”) and log2(“, yratio,”)”, sep=””),paste(“lines: log2(“,round(1/xratio, digit=2),”) and log2(“, xratio,”)”, sep=””), sep=”\n”))
    } else {
    title(main=paste(yaxis, “vs.”,xaxis))
    }
    if (Brown==TRUE){
    index=identify(x=x.all, y=y.all, label=””, cex=text.cex, col=handpicked.color, pos = TRUE)
    points(x.all[index$ind], y.all[index$ind], col=handpicked.color, cex = point.size, pch = shape.all[index$ind])
    text(x.all[index$ind], y.all[index$ind], col=handpicked.label, label = names[index$ind], cex = text.cex, pos = 4, offset = 0.25)

    }

    # get the positions of the chosen genes and depict them on the plot (do this also for genes above and below the line)
    ###################################
    message(” Set chosen colors”)
    for (i in 1:length(chosen.input.files)){

    # get the labels and colors
    col1 = get(paste(“chosen”, i, “color”, sep=”.”))
    lab1 = get(paste(“chosen”, i, “label”, sep=”.”))

    # get the overlap
    overlap=which(universe$Common %in% get(paste(“chosen”, i, sep=”.”)) | universe$Systematic %in% get(paste(“chosen”, i, sep=”.”)))
    if (length(get(paste(“chosen”, i, sep=”.”)))){
    points(x.all[overlap], y.all[overlap], col=col1, pch=shape.all[overlap], cex = point.size)
    if(show.chosen == “TRUE”){
    text(x.all[overlap], y.all[overlap], labels= names[overlap], pos=4, col=lab1, cex=text.cex, offset = 0.25)
    }
    ratio.chosen = log2(2^y.all[overlap]/2^x.all[overlap])
    names(ratio.chosen) = names[overlap]
    assign(paste(“chosen.up”, i, sep=”.”),names[overlap[ratio.chosen>=intercept]])
    assign(paste(“chosen.down”, i, sep=”.”),names[overlap[ratio.chosen0){
    write.table(as.data.frame(dat1), file=outfile, sep=”\t”, quote=F, row.names=F, col.names=F)
    write.table(as.data.frame(universe[universe$Common %in% as.character(unique(dat1)) | universe$Systematic %in% as.character(unique(dat1)),”Systematic”]), file=outfile2, sep=”\t”, quote=F, row.names=F, col.names=F)
    }
    }
    }

    if(my.legend == TRUE){
    legend(“bottomright”, bty = “n”, fill = chosen.colors, legend = chosen.input.files, cex = legend.size, pt.cex = legend.size)
    }

    # write out the names of the different points

    if(Brown==TRUE){
    index.names=universe[index$ind,c(“Common”, “Systematic”)]
    index.names$Common=as.character(index.names$Common)
    index.names$Common[index.names$Common==”—“]=as.character(index.names$Systematic[index.names$Common==”—“])
    if (nrow(index.names)>0){
    write.table(as.data.frame(index.names[,1]), file=paste(directory, “/”, yaxis, “_vs_”, xaxis,”_handpicked_genes_”, handpicked.color, “.tsv”, sep=””), sep=”\t”, quote=F, row.names=F, col.names=F)
    # write out systematic names
    write.table(as.data.frame(index.names[,2]), file=paste(directory, “/”, yaxis, “_vs_”, xaxis,”_handpicked_genes_”, handpicked.color, “systematic.tsv”, sep=””), sep=”\t”, quote=F, row.names=F, col.names=F)
    }
    }

    keep1.names=universe[keep1,c(“Common”, “Systematic”)]
    keep1.names$Common=as.character(keep1.names$Common)
    keep1.names$Common[keep1.names$Common==”—“]=as.character(keep1.names$Systematic[keep1.names$Common==”—“])
    if (nrow(keep1.names)>0){
    write.table(as.data.frame(keep1.names[,1]), file=paste(directory, “/”, yaxis, “_vs_”, xaxis,”_”, above.line.color, “_dots.tsv”, sep=””), sep=”\t”, quote=F, row.names=F, col.names=F)
    write.table(as.data.frame(keep1.names[,2]), file=paste(directory, “/”, yaxis, “_vs_”, xaxis,”_”, above.line.color, “_dots_systematic.tsv”, sep=””), sep=”\t”, quote=F, row.names=F, col.names=F)
    }
    keep2.names=universe[keep2,c(“Common”, “Systematic”)]
    keep2.names$Common=as.character(keep2.names$Common)
    keep2.names$Common[keep2.names$Common==”—“]=as.character(keep2.names$Systematic[keep2.names$Common==”—“])
    if (nrow(keep2.names)>0){
    write.table(as.data.frame(keep2.names[,1]), file=paste(directory, “/”, yaxis, “_vs_”, xaxis,”_”, below.line.color, “_dots.tsv”, sep=””), sep=”\t”, quote=F, row.names=F, col.names=F)
    write.table(as.data.frame(keep2.names[,2]), file=paste(directory, “/”, yaxis, “_vs_”, xaxis,”_”, below.line.color, “_dots_systematic.tsv”, sep=””), sep=”\t”, quote=F, row.names=F, col.names=F)
    }

    keep3.names=rbind(keep1.names, keep2.names)
    if (nrow(keep3.names)>0){
    write.table(as.data.frame(keep3.names[,1]), file=paste(directory, “/”, yaxis, “_vs_”, xaxis,”_”, above.line.color, “_and_”, below.line.color, “_dots.tsv”, sep=””), sep=”\t”, quote=F, row.names=F, col.names=F)
    write.table(as.data.frame(keep3.names[,2]), file=paste(directory, “/”, yaxis, “_vs_”, xaxis,”_”, above.line.color, “_and_”, below.line.color, “_dots_systematic.tsv”, sep=””), sep=”\t”, quote=F, row.names=F, col.names=F)
    }

    #display the numbers of upregulated and downregulated genes
    total <- dim(universe)[1]
    nup <- dim(keep1.names)[1]
    ndown <- dim(keep2.names)[1]
    text(x=-5, y=5, labels = paste("up",nup, "genes\n",
    round(nup/total, digits = 4)*100, "%", sep = " "), cex = 0.7)
    text(x=3, y=-4, labels = paste("down",ndown, "genes\n",
    round(ndown/total, digits = 4)*100, "%", sep = " "), cex = 0.7)
    text(x=-4, y=-4, labels = paste("diff",ndown+nup, "genes\n",
    round((ndown+nup)/total, digits = 4)*100, "%", sep = " "), cex = 0.7)

    my.plot=recordPlot()
    pdf(file=paste(directory, "/", yaxis, "_vs_", xaxis,"_plot.pdf", sep=""), height = 6.5, width = 6.5)
    replayPlot(my.plot)
    dev.off()

    if (enrichment == TRUE){
    message(" Calculate enrichment…")
    # do the enrichment analysis for the three groups of genes
    condition1=yaxis
    condition2=xaxis
    fold.change1=yratio
    fold.change2=xratio

    my.chosen.objects = grep("syst", grep("(color|label|files)", grep("^chosen", objects(), value = T), value = T, invert = T), invert = T, value = T)

    for (obj in my.chosen.objects){
    assign(paste(obj, "syst", sep = "."), as.character(universe[universe$Systematic %in% get(obj) | universe$Common %in% get(obj),c("Systematic")]))
    }

    reduced.data = universe[,c("Systematic", xaxis, yaxis)]
    rownames(reduced.data) = universe$Systematic
    above.line.syst=as.character(keep1.names$Systematic)
    if (exclusion.quadrant == TRUE){
    above.filter.1 = reduced.data[above.line.syst, c(xaxis)]< xaxis.limit
    above.filter.2 = reduced.data[above.line.syst, yaxis]0)>0){
    above.line.syst = above.line.syst[above.filter == 0]
    }
    }

    below.line.syst=as.character(keep2.names$Systematic)
    if (exclusion.quadrant == TRUE){
    below.filter.1 = reduced.data[below.line.syst, c(xaxis)]< xaxis.limit
    below.filter.2 = reduced.data[below.line.syst, yaxis]0)>0){
    below.line.syst = below.line.syst[below.filter == 0]
    }
    }

    above.line.and.below.line.syst=c(above.line.syst, below.line.syst)
    if (Brown==TRUE){
    handpicked.syst=as.character(index.names$Systematic)
    }
    #ontology=”BP” #change ontology here (B.P, M.F, C.C)
    ####################################################
    ####################################################
    ####################################################
    if (Brown == TRUE){
    all.gene.groups = c(“above.line.syst”, “below.line.syst”, “above.line.and.below.line.syst”, “handpicked.syst”)
    } else {
    all.gene.groups = c(“above.line.syst”, “below.line.syst”, “above.line.and.below.line.syst”)
    }

    # if (Black.only == TRUE){
    # all.gene.groups = c(“chosen.1.syst”, “chosen.2.syst”)
    # }
    if (no.black == TRUE){
    all.gene.groups = all.gene.groups
    } else if (black.only == TRUE) {
    all.gene.groups = grep(“chosen.*syst”, objects(), value = T)
    } else {
    all.gene.groups = c(all.gene.groups,grep(“chosen.*syst”, objects(), value = T))
    }
    # what are the GO enrichments
    message(” What are the GO enrichments”)
    for (g in all.gene.groups){
    #for (ontology in c(“MF”, “BP”, “CC”)){#Delete whatever component I don’t want
    for (ontology in c(“MF”, “BP”, “CC”)){
    # what are the selected genes?
    universeIds=as.character(universe$Systematic)
    selectedGenes=get(g)

    if(length(selectedGenes) == 0){ next }

    params <- new("GOHyperGParams",
    geneIds=selectedGenes,
    universeGeneIds=universeIds,
    annotation="org.Sc.sgd.db",
    ontology=ontology,
    pvalueCutoff=hgCutoff,
    conditional=TRUE,
    testDirection="over")

    #you have to change ontology="CC"
    #to ontology="M.F" or "C.C" or "B.P"

    # make a hypergeometric test for overrepresentation
    hgOver=hyperGTest(params)

    ### retrieve genes corresponding to each GO ID
    # (GO ID, GO description, p-value)
    summaryHgOver=summary(hgOver)
    out <- summaryHgOver[,c(1,2,7)]
    # retrieve input genes associated with each GO identifier
    # use the org.Sc data mapping to get GO terms for each ID
    column=paste("GO", ontology, "ID", sep="")
    goMaps <- lapply(out[,column], function(x) unlist(mget(x, org.Sc.sgdGO2ALLORFS)))
    # it assigns each gene, which surrived the filtering, to all GO terms, which are enriched
    goSelected <- lapply(goMaps, function(x) selectedGenes[selectedGenes %in% x])

    # append a column to the table
    summaryHgOver[,ncol(summaryHgOver)+1]=do.call(rbind, lapply(goSelected, function(x) paste(unique(x), collapse=";")))

    # map yeast ORFs to common names

    common = lapply(goSelected, function(x){ y = sapply(x, function(z){ zz = as.character(universe[(universe$Systematic %in% z), "Common"] )})})
    commonNames=lapply(common, function(x){paste(unique(unlist(strsplit(unlist(x),", "))), collapse=";")})
    summaryHgOver[,ncol(summaryHgOver)+1]=do.call(rbind, commonNames)
    colnames(summaryHgOver)[(ncol(summaryHgOver)-1):ncol(summaryHgOver)] = c("Systematic", "Common")

    # write out a table
    gg=gsub("\\.syst", "", g)
    gg1 = gsub("\\.$", "", gsub("^\\.", "", unlist(strsplit(gg, "and"))))
    if (g == "handpicked.syst") {gg1 = "handpicked"}
    gene.colors = paste(sapply(paste(gg1, "color", sep="."), get), collapse = "_and_")

    ###########################
    if (grepl("chosen", gg)){
    ch.object = unlist(strsplit(gg1,"\\."))
    num = as.numeric(ch.object[length(ch.object)])
    ch.label = as.character(ch.object[length(ch.object)-1])
    ch.col = get(paste("chosen", num, "color", sep="."))
    ch.file = chosen.input.files[num]
    if (ch.label == "chosen"){
    ch.label = ""
    }
    gene.colors = paste(toupper(ch.label), "chosen_from", gsub("\\.(txt|tsv)","", ch.file), sep="_")

    }
    ########################################
    if (g == "handpicked.syst") { gene.colors = paste ("handpicked", gene.colors,sep ="_" )}
    filename=paste(directory, "/", condition1, "_FC", fold.change1,"_vs_", condition2, "_FC", fold.change2,"_", ontology, "_",hgCutoff,"_", gene.colors, "_genes.tsv", sep="")
    #print(filename)

    ## EDIT (Feb 20, 2015): Adding FDR correction
    ## patrick@jimmy.harvard.edu
    summaryHgOver <- cbind(summaryHgOver, FDR=p.adjust(summaryHgOver$Pvalue))
    dir.create(file.path(directory, "Full"), showWarnings=F, recursive=T)
    dir.create(file.path(directory, "Filtered"), showWarnings=F, recursive=T)
    dir.create(file.path(directory, "Full_with_less_columns"), showWarnings=F, recursive=T)
    dir.create(file.path(directory, "Filtered_with_less_columns"), showWarnings=F, recursive=T)
    filename.full=file.path(dirname(filename), "Full", basename(filename))
    filename.filtered=file.path(dirname(filename), "Filtered", gsub("\\.tsv$", ".filtered.tsv", basename(filename)))
    filename.full.less=file.path(dirname(filename), "Full_with_less_columns", basename(filename))
    filename.filtered.less=file.path(dirname(filename), "Filtered_with_less_columns", gsub("\\.tsv$", ".filtered.tsv", basename(filename)))
    write.table(summaryHgOver, file=filename.full, sep="\t", quote=F, row.names=F, col.names=T)
    write.table(summaryHgOver[,-c(3,8,9)], file=filename.full.less, sep="\t", quote=F, row.names=F, col.names=T)
    summaryHgOver.filtered <- summaryHgOver[summaryHgOver$FDR < FDR.thresh, , drop=F]
    write.table(summaryHgOver.filtered, file=filename.filtered, sep="\t", quote=F, row.names=F, col.names=T)
    write.table(summaryHgOver.filtered[,-c(3,8,9)], file=filename.filtered.less, sep="\t", quote=F, row.names=F, col.names=T)
    }
    }

    # concatenate the enrichment files
    message("Concatenate enrichment files…")

    search.directory = directory
    ####===================================================
    files = list.files(search.directory, pattern = "(MF|BP|CC).*genes.tsv", full.names=T)
    #files = list.files(search.directory, pattern = "BP.*genes.tsv", full.names=T)
    #############===================================================

    #####################################

    file.list = list()
    for (f in files){
    file.list[[f]] = read.delim(f, header = F)
    }

    names(file.list) = gsub("^.*\\/","", names(file.list))

    new.list = list()
    count = 1
    for (i in 1:length(file.list)){
    d = t(data.frame(c(names(file.list)[i], rep("", 4))))
    colnames(d) = paste("V", 1:5, sep="")
    if(nrow(file.list[[i]])=number.of.lines){
    new2 = rbind(new[1:(number.of.lines+2),], empty)
    } else {
    new2 = rbind(new, empty)
    }
    new.list[[i]] = new2
    }

    write.table(do.call(rbind, new.list), file=paste(search.directory, “/GO.enrichments.all.tables.”, directory, “.txt”, sep=””), row.names=F, col.names=F, sep=”\t”, quote = F)
    }
    message(” Done.”)

    ## Additional: February 20, 2015, patrick@jimmy.harvard.edu
    ##

    1. Hi Thomas,

      I’m not an expert but I’ll have a look through your code on the weekend and see if I can answer your question.

      Cheers,

      Dave

      1. Dear Dave,

        How are you? I thank you so very much for kindly reaching out to me by offering your assistance in helping me earning a PhD in bioinformatics despite being almost blind. I am glad you responded to my post. Since my poor vision makes it hard for me to read and write I would prefer discussing details with you by voice. Do you have Skype? If so, what is your Skype ID? Mine is tfh002. Could you please send me a Skype invite? When would be a good time for us to talk? Where are you located?

        I kind of need to have my thesis completed within the next 2 weeks. I already have R scripts. I would need help in having the software summarizing the results for me. It is too much for me to compare all relevant GO (Gene Ontology) terms visually. I also need help in comparing my findings with other publications. There are so many ways to analyze data. I am not sure which approach to use. My R script is 600 lines long and produces almost 200 enrichment files with every single run. I am feeling completely overwhelmed with data.

        Could you please respond to me directly via email Skype or phone because my text to speech software has difficulties reading out this website aloud to me. I can communicate much better in the much more familiar interface of G-mail and Skype.

        I am very much looking forward to the opportunity to talk to you by voice soon.

        I thank you very much in advance for your assistance, time, help, efforts and support.

        With very warm regards,

        Thomas Hahn
        Cell phone: 318 243 3940
        Skype ID: tfh002
        Email: Thomas.F.Hahn2@gmail.com

  3. Hi Dave,
    Excellent post.
    I don’t see where the “x” object came from when you convert to a list.

    Thank you.

    1. Thanks for spotting the typo! It’s supposed to be


      entrez_to_go <- as.list(entrez_object[mapped_genes])

  4. Hi Dave,

    Could you please explain, where did the “ËntrezGene” come from prior to line 19, the section where you explain on how to convert into EntrezID?

    Cheers!

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.