Updated: 2019 March 24th
The Gene Ontology Enrichment Analysis is a popular type of analysis that is carried out after a differential gene expression analysis has been carried out. There are many tools available for performing a gene ontology enrichment analysis. Online tools include DAVID, PANTHER and GOrilla. Bioconductor pacakges include GOstats, topGO and goseq.
In this post, I'll go through the GOstats package. Firstly download and install some packages that will be used in this post.
if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("GOstats") # this should be installed when you install GOstats # install GO.db too, which is # A set of annotation maps describing the entire Gene Ontology BiocManager::install("GO.db") # Genome wide annotation for Human, primarily based on mapping using Entrez Gene identifiers. BiocManager::install("org.Hs.eg.db")
Now let's create a toy example where my test set of genes are those that are associated with the GO term GO:0007411 (axon guidance). We can use the org.Hs.eg.db package to fetch all genes associated with axon guidance.
library(org.Hs.eg.db) # for reference's sake # how to get GO terms from an Entrez gene id following the manual # http://bioconductor.org/packages/release/data/annotation/manuals/org.Hs.eg.db/man/org.Hs.eg.db.pdf # org.Hs.egGO is an R object that provides # mappings between Entrez gene identifers and the GO # identifers that they are directly associated with entrez_object <- org.Hs.egGO class(entrez_object) [1] "Go3AnnDbBimap" attr(,"package") [1] "AnnotationDbi" # Get the Entrez gene IDs that are mapped to a GO ID mapped_genes <- mappedkeys(entrez_object) # Entrez gene IDs head(mapped_genes) [1] "1" "10" "100" "1000" "10000" "10001" # map GO IDs to Entrez IDs and convert to a list entrez_to_go <- as.list(entrez_object[mapped_genes]) length(entrez_to_go) [1] 20408 # http://www.ncbi.nlm.nih.gov/gene/?term=1 # entrez gene id 1 is A1BG # below are the GO terms associated with the gene sapply(entrez_to_go[[1]], function(x) x) GO:0002576 GO:0008150 GO:0043312 GO:0005576 GO:0005576 GO:0005576 GO:0005615 GOID "GO:0002576" "GO:0008150" "GO:0043312" "GO:0005576" "GO:0005576" "GO:0005576" "GO:0005615" Evidence "TAS" "ND" "TAS" "HDA" "IDA" "TAS" "HDA" Ontology "BP" "BP" "BP" "CC" "CC" "CC" "CC" GO:0031093 GO:0034774 GO:0062023 GO:0070062 GO:0072562 GO:1904813 GO:0003674 GOID "GO:0031093" "GO:0034774" "GO:0062023" "GO:0070062" "GO:0072562" "GO:1904813" "GO:0003674" Evidence "TAS" "TAS" "HDA" "HDA" "HDA" "TAS" "ND" Ontology "CC" "CC" "CC" "CC" "CC" "CC" "MF" # perform the reverse and map GO terms to Entrez gene ids go_object <- as.list(org.Hs.egGO2EG) length(go_object) [1] 17983 # GO:0000002 -> mitochondrial genome maintenance # entrez gene id 291 -> SLC25A4 # which if you look up has the GO term mitochondrial genome maintenance go_object[1] $`GO:0000002` TAS IMP ISS IMP IMP NAS IMP IBA IDA IEA IDA IMP "291" "1890" "4205" "4358" "4976" "9361" "10000" "55154" "55186" "80119" "84275" "92667" # now let's get all the entrez gene ids for # GO:0007411 (axon guidance) axon_gene <- go_object['GO:0007411'] length(unlist(axon_gene, use.names=FALSE)) [1] 201 # not all genes are unique length(unique(unlist(axon_gene, use.names=FALSE))) [1] 191 # use only unique genes axon_gene <- unique(unlist(axon_gene, use.names=FALSE)) head(axon_gene) [1] "323" "474" "627" "655" "682" "1002"
Now let's perform the enrichment analysis; note that I use all genes with GO terms as the universe list. This list should be set to the genes that were actually assayed in your set of experiments. In addition, I am only testing for biological process terms that are over-represented and selecting terms that have a p-value of 0.001 or less.
library("GO.db") library("GOstats") # I will use all the genes with GO terms as the universe universe <- mapped_genes length(universe) [1] 20408 length(axon_gene) [1] 191 params <- new('GOHyperGParams', geneIds = axon_gene, universeGeneIds = universe, ontology = 'BP', pvalueCutoff = 0.001, conditional = FALSE, testDirection = 'over', annotation = "org.Hs.eg.db" ) hgOver <- hyperGTest(params) hgOver Gene to GO BP test for over-representation 4617 GO BP ids tested (1045 have p < 0.001) Selected gene set size: 191 Gene universe size: 18493 Annotation package: org.Hs.eg result <- summary(hgOver) GOBPID Pvalue OddsRatio ExpCount Count Size 1 GO:0007409 0.000000e+00 Inf 4.637376 191 449 2 GO:0007411 0.000000e+00 Inf 2.675012 191 259 3 GO:0097485 0.000000e+00 Inf 2.685340 191 260 4 GO:0061564 1.572265e-319 Inf 5.091819 191 493 5 GO:0048667 1.383219e-306 Inf 5.773482 191 559 6 GO:0006935 2.278322e-296 Inf 6.393176 191 619 Term 1 axonogenesis 2 axon guidance 3 neuron projection guidance 4 axon development 5 cell morphogenesis involved in neuron differentiation 6 chemotaxis
As expected, GO terms associated to axons are enriched. Note that the Size for GO:0007411 is 259 and not 191; we had selected all genes that were associated with GO:0007411 (axon guidance), so I was expecting the Count and the Size to be identical. Perhaps we didn't select all genes associated with axon guidance.
What if my gene list are not Entrez gene ids?
For this example, I will demonstrate how to convert Ensembl gene IDs to Entrez gene IDs using biomaRt.
if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("biomaRt", version = "3.8") library("biomaRt") ensembl <- useMart("ensembl", dataset="hsapiens_gene_ensembl") my_chr <- c(1:22, 'M', 'X', 'Y') my_ensembl_gene <- getBM(attributes='ensembl_gene_id', filters = 'chromosome_name', values = my_chr, mart = ensembl) dim(my_ensembl_gene) [1] 60587 1 five_ensembl <- my_ensembl_gene[1:5,] five_ensembl_to_entrez <- getBM(attributes=c('ensembl_gene_id', 'entrezgene_id'), filters = 'ensembl_gene_id', values = five_ensembl, mart = ensembl) # not all Ensembl gene IDs have Entrez IDs five_ensembl_to_entrez ensembl_gene_id entrezgene_id 1 ENSG00000206779 NA 2 ENSG00000207256 NA 3 ENSG00000221643 NA 4 ENSG00000252151 NA 5 ENSG00000252682 NA # out of interest how many entrez gene ids? my_entrez_gene <- getBM(attributes='entrezgene_id', filters = 'chromosome_name', values = my_chr, mart = ensembl) dim(my_entrez_gene) [1] 19399 1 # get some more info on the entrez_gene my_attribute <- c('entrezgene_id', 'hgnc_symbol', 'chromosome_name', 'start_position', 'end_position', 'strand') my_entrez_gene_info <- getBM(attributes=my_attribute, filters = c('entrezgene_id', 'chromosome_name'), values = list(entrezgene=my_entrez_gene$entrezgene_id, chromosome_name=my_chr), mart = ensembl) head(my_entrez_gene_info) entrezgene_id hgnc_symbol chromosome_name start_position end_position strand 1 23612 PHLDA3 1 201464278 201469237 -1 2 4179 CD46 1 207752054 207795513 1 3 9909 DENND4B 1 153929501 153946718 -1 4 57459 GATAD2B 1 153789030 153923360 -1 5 54499 TMCO1 1 165724293 165827755 -1 6 7371 UCK2 1 165827614 165911618 1 dim(my_entrez_gene_info) [1] 19495 6
To learn more about the missing Entrez ID values from the Ensembl conversion see this useful post on BioStars and my post on converting gene identifiers.
Further reading
This work is licensed under a Creative Commons
Attribution 4.0 International License.
Nice write up for GOstats, but did you actually use topGO anywhere here?
When I created the post, the aim was to test topGO as well. As you’ve spotted, I didn’t actually use it.
Perhaps for the next update.
Hi!
Why do you use the background (universe) as the setdiff between your axon genes and all genes? I thought the right way to do it was to use all the genes annotated.
Are you just trying to amp up your p-values, a.k.a, significance of results obtained (by using setdiff) for demo sake?
Hi Shrutii,
yes you’re right; the hypergeometric-based tests requires all genes as the gene universe.
At one stage, I was comparing the results when using different gene universes. I’ll modify the post to state this clearly.
Thanks for the comment!
Cheers,
Dave
Hi Dave,
Very well exlained and informative post.
Keep Posting!
How to compare and categorize enrichment CO terms from multiple files?
I need to find a way to substitute more specific child GO terms with less specific parent GO terms since my remaining vision is not good enough for comparing them by visually looking at the enriched GO terms. I have copy-pasted my latest version of the R code I am currently using at the end of this email. Could anyone please assist me in accomplishing this or connect me to somebody, who might? I only have 4 weeks left to finish my thesis. If I cannot defend by the end of March my funding will be discontinued.
Could you please respond to me directly via email (Thomas.F.Hahn2@gmail.com) or Skype (tfh002) or cell phone (318 243 3940) because my text to speech software, on which I am depending to access electronic information due to my very limited remaining vision, cannot read out all portions of this website aloud to me? Fortunately, it is much more compatible with the G-mail and Skype interface thus allowing me to communicate much more efficiently and reliably when using those.
I am very grateful for any kind of assistance in earning a PhD in bioinformatics despite being almost blind.
With very warm regards,
Thomas Hahn
Cell phone: 318 243 3940
Skype ID: tfh002
Email: Thomas.F.Hahn2@gmail.com
#################################################################################################
###
### This script combines the two subscripts that preparation and plotting of the log-fold
### yeast expression changes under CR and YEPD
###
### Additional (minor) changes:
### –
###
### Feb 20, 2015
###
#################################################################################################
message(“Clean up workspace”)
rm(list=ls(all=T))
#————————————————————————————————
### Generate scatter plot input file from:
### 1. Annotation file from Expression Lab (Affy)
### 2. Expression file from Expression Lab (Affy)
### Creates:
### 1. *.for.scatterplot.txt file in the folder in the current working directory
message(“Create Scatterplot input files…”)
setwd(“D:/Analysis”)
## INPUT FILES (relative or full path)
file=”Our_data_from_expression_console.txt” # file with data from expression console
annotation = read.delim(“Our_data_annotation.txt”, header = F) # annotation for replicates
## OUTPUT directory
my.directory = dir.scatter=”2015_02_26_Erg6″
chosen.folder = “chosen_genes”
#######################################################
replicates = tapply(annotation[,1], annotation[,2], list)
normdata = read.delim(file, check.names = F)
####Inserted by Patrick on Thursday, February 26th to set the minimum threshold to 7.6
adaptLowerBound <- TRUE # set to FALSE if no adaption so take place !!!
if (adaptLowerBound) {
tmp.colIndices <- 2:7 # column indices of expression values !!!
lowerBound <- 7.61 # hard-coded lower boundary of expression values !!!
normdata.orig <- normdata
normdata.sub <- normdata[,tmp.colIndices]
normdata.sub[which(normdata.sub <= lowerBound, arr.ind=T)] <- lowerBound
normdata[,tmp.colIndices] 1))
# remove the ones with multiple probes and leave only the conditions and ratios
final.normdata=normdata[!(as.character(normdata$Systematic) %in% multiple), c(“Probesets”,”Systematic”,”Common”,names(replicates), write.ratios)]
pombe = grep(“^SP”,as.character(final.normdata$Systematic))
affy = grep(“^AF”, as.character(final.normdata$Systematic))
remove = c(pombe, affy)
if (length(remove)>0){
final.normdata = final.normdata[-remove,]
}
normfile=gsub(“.txt”, “.for.scatterplot.txt”, file)
dir.create(dir.scatter, showWarnings=F)
fn.scatterInput <- file.path(dir.scatter, normfile)
write.table(final.normdata, fn.scatterInput, sep="\t", row.names=FALSE, col.names=TRUE, quote=F)
message(" Done.")
#————————————————————————————————
### Generate scatter plot from:
### 1. *.for.scatterplot.txt file in the "Scatterplot_out" folder in the current working directory (not required to be specified)
### 2. *.txt list of genes
### Creates:
### 1. *.for.scatterplot.txt file in the "Scatterplot_out" folder in the current working directory
message("Create Scatterplot…")
#setwd("D:/Tang/Patrick/Friday_Feb20_2015")
## INPUT FILES
# read in the file with chosen genes from a txt file
# give the names of the chosen files, the colors and the labels
chosen.input.files = c(
"Ergosterol SGD 150 genes.txt",
"Dietary Restriction 72 genes.txt",
"Autophagy SGD 61 genes.txt",
"Sphingolipids 12 genes.txt",
"ER lipid enzymes 9 genes.txt",
"Flipases 7 genes.txt",
"mito lipid enzymes 2 genes.txt",
"Vacuolar Lipid Enzymes 2 genes.txt",
"Phospholipid binding 16 genes.txt",
"Glucose transport 9 genes.txt"
)
# what is the gene universe
#universe=read.delim(file=fn.scatterInput, header=T)
#universe$Systematic=gsub("[.]S1", "", universe$Systematic)# replaces .S1 with empty strings
universe=read.delim(file=fn.scatterInput, header=T)
universe$Systematic=gsub("[.]S1", "", universe$Systematic)# replaces .S1 with empty strings
# select the columns
yaxis="Erg6c.WTc"
#yaxis="Our.WTc.Our.WTy"
#yaxis="WTy"
#yaxis = colnames(universe)[4]#sets the Y axis to the name of the last coumn
#yaxis = colnames(universe)[ncol(universe)]#sets the Y axis to the name of the last coumn
#yaxis=ncol(universe)
#yaxis="Ag15c.Atg15yR1"
yratio=1.5
#xaxis=colnames(universe)[5]
xaxis="Erg6y.WTy"
#xaxis="WT.CR.2.WT.2"
#xaxis="X2.G.12.h"
xratio=1.5
logarithm = TRUE # for the ratio the logarithm should be set to TRUE
enrichment = TRUE
my.legend = TRUE
Brown = FALSE
no.black = FALSE # if set to TRUE it does not do enrichment for the chosen genes. True means it will do everything, except for the chosen genes
black.only = FALSE # put it to FALSE if you want all genes
#Black = TRUE# it does enrichment only for the black genes (put it to FALSE to do it for everybody). False means that it enriches the over and under expressed but not the chosen
if (enrichment == TRUE){
library(GOstats)
library(org.Sc.sgd.db)
}
exclusion.quadrant = FALSE
xaxis.limit = 7.61
yaxis.limit = 7.61
exclusion.color = "violet"
################################
number.of.lines = 5
chosen.colors = c("cornflowerblue", "gray53", "seagreen4", "blue", "deeppink3", "saddlebrown", "red", "black", "blueviolet", "darkblue")
chosen.labels = c("cornflowerblue", "gray53", "seagreen4", "blue", "deeppink3", "saddlebrown", "red", "black", "blueviolet", "darkblue")
my.directory = "2015_02_26_Erg6"
#####################################################################
### choose colors for the points
above.line.color = "yellow"
below.line.color = "cyan"
handpicked.color = "saddlebrown"
middle.genes.color = "moccasin"
above.line.label = "red"
below.line.label = "blue"
handpicked.label = "burlywood4"
middle.genes.label = "navajowhite4"
### choose colors for the lines
upper.diagonal.color = "yellow"
lower.diagonal.color = "yellow"
left.vertical.line = "yellow"
right.vertical.line = "yellow"
upper.horizontal.line = "yellow"
lower.horizontal.line = "yellow"
# define the shapes
shape.above.line = 17
shape.below.line = 15
legend.size = 0.4# font size
point.size = 0.4# size of points in scatter plot
show.above.names = "TRUE" #you can change it to FALSE
show.below.names = "TRUE"
show.middle.names = "FALSE"
show.chosen = "TRUE"
# define parameters for gene ontology enrichment
hgCutoff <- 0.05
FDR.thresh=0.1
text.cex=0.25 # This number tells the font size of the selected genes.
#######################################################################
#######################################################################
names=as.character(universe$Common)
nocommon=which(names=="—")
names[nocommon]=as.character(universe$Systematic[nocommon])
# define the chosen genes and also the colors for the up and down regulated
# read in the file with chosen genes from a txt file
for ( i in 1:length(chosen.input.files)){
temp = vector()
c.name = paste("chosen", i, sep = ".")
assign(c.name, temp)
assign(c.name, toupper(as.character(unlist(read.delim(file=paste(chosen.folder, "/",chosen.input.files[i], sep = ""), header=F)))))
temp2 = get(c.name)
assign(c.name, temp2[which((temp2 %in% universe$Systematic | temp2 %in% universe$Common) )])
# assign colors for points and labels for the chosen genes
col.name = paste("chosen", i, "color", sep = ".")
lab.name = paste("chosen", i, "label", sep = ".")
assign(col.name, chosen.colors[i])
assign(lab.name, chosen.labels[i])
assign( paste("chosen", i, "up.color", sep = "."), chosen.colors[i])
assign( paste("chosen", i, "down.color", sep = "."), chosen.colors[i])
assign(paste("chosen", i, "up.label", sep = "."), chosen.labels[i])
assign(paste("chosen", i, "down.label", sep = "."), chosen.labels[i])
}
dir.create(my.directory, showWarnings=F)
directory=paste(my.directory, paste(yaxis, "vs", xaxis, "FC", yratio, "FC", xratio, "cutoff", hgCutoff, sep="_" ), sep="/")
dir.create(directory, showWarnings=F)
if (logarithm == TRUE){
x.all = log2(universe[,xaxis])
y.all = log2(universe[,yaxis])
labs.pattern = "log2"
intercept = 2 * log2(yratio)
b1 = "("
b2 = ")"
} else {
x.all = universe[,xaxis]
y.all = universe[,yaxis]
labs.pattern = b1 = b2 = ""
intercept = log2(yratio)
#intercept = (yratio)
}
keep1= x.all y.all + intercept
### plot the whole thing at once
shape.all = rep(19, length(x.all))
shape.all[keep1] = 17
shape.all[keep2] = 15
color.all = rep(middle.genes.color, length(x.all))
color.all[keep1] = above.line.color
color.all[keep2] = below.line.color
# plot all the points
if(logarithm==TRUE){
plot(x.all, y.all, xlab=paste(labs.pattern,b1 ,xaxis,b2, sep=””), ylab=paste(labs.pattern,b1,yaxis,b2, sep=””),pch=shape.all, col=color.all, cex = point.size)#, xlim=c(0,log2(max(universe[,xaxis]))), ylim=c(0,log2(max(universe[,yaxis]))))
}else{
plot(x.all, y.all, xlab=paste(labs.pattern,b1 ,xaxis,b2, sep=””), ylab=paste(labs.pattern,b1,yaxis,b2, sep=””),pch=shape.all, col=color.all, cex = point.size)#, xlim=c(0,max(universe[,xaxis])), ylim=c(0,max(universe[,yaxis])))
}
xr=range(x.all)
#xr[1]=0
axis(1, at=round(xr[1]):round(xr[2]), labels=round(xr[1]):round(xr[2]))
yr=range(y.all)
#yr[1]=0
axis(2, at=round(yr[1]):round(yr[2]), labels=round(yr[1]):round(yr[2]))
text(round(xr[1])+2, round(yr[2])-1,paste(“r = “,round(cor.test(x.all, y.all)$estimate, digit=2), sep=””))
abline(0,1, col=”tan”)
abline(intercept, 1 , col=upper.diagonal.color, lwd=2)
abline(-intercept,1 , col=lower.diagonal.color, lwd=2)
if (logarithm == TRUE){
abline(h=log2(yratio), col=upper.horizontal.line, lwd=2)
abline(h=log2(1/yratio), col=lower.horizontal.line, lwd=2)
abline(v=log2(xratio), col=right.vertical.line, lwd=2)
abline(v=log2(1/xratio), col=left.vertical.line, lwd=2)
}
points(x.all, y.all,pch=shape.all, col=color.all, cex = point.size)
if (show.above.names == “TRUE”){
text(x = x.all[ keep1 ], y = y.all[ keep1 ], label = names[ keep1 ], col = above.line.label, cex = text.cex, pos = 4, offset = 0.25)
}
if (show.below.names == “TRUE”){
text(x = x.all[ keep2 ], y = y.all[ keep2 ], label = names[ keep2 ], col = below.line.label, cex = text.cex, pos = 4, offset = 0.25)
}
if (show.middle.names == “TRUE”){
text(x = x.all[ !(keep2 | keep1) ], y = y.all[!(keep2 | keep1) ], label = names[ !(keep2 | keep1)], col = middle.genes.label, cex = text.cex, pos = 4, offset = 0.25)
}
if (exclusion.quadrant == TRUE){
rect(0, yaxis.limit, xaxis.limit, yaxis.limit, border = exclusion.color)
rect(xaxis.limit, 0 , xaxis.limit, yaxis.limit, border = exclusion.color)
}
if (logarithm == TRUE){
title(main=paste(paste(yaxis, “vs.”,xaxis),paste(“lines: log2(“,round(1/yratio, digit=2),”) and log2(“, yratio,”)”, sep=””),paste(“lines: log2(“,round(1/xratio, digit=2),”) and log2(“, xratio,”)”, sep=””), sep=”\n”))
} else {
title(main=paste(yaxis, “vs.”,xaxis))
}
if (Brown==TRUE){
index=identify(x=x.all, y=y.all, label=””, cex=text.cex, col=handpicked.color, pos = TRUE)
points(x.all[index$ind], y.all[index$ind], col=handpicked.color, cex = point.size, pch = shape.all[index$ind])
text(x.all[index$ind], y.all[index$ind], col=handpicked.label, label = names[index$ind], cex = text.cex, pos = 4, offset = 0.25)
}
# get the positions of the chosen genes and depict them on the plot (do this also for genes above and below the line)
###################################
message(” Set chosen colors”)
for (i in 1:length(chosen.input.files)){
# get the labels and colors
col1 = get(paste(“chosen”, i, “color”, sep=”.”))
lab1 = get(paste(“chosen”, i, “label”, sep=”.”))
# get the overlap
overlap=which(universe$Common %in% get(paste(“chosen”, i, sep=”.”)) | universe$Systematic %in% get(paste(“chosen”, i, sep=”.”)))
if (length(get(paste(“chosen”, i, sep=”.”)))){
points(x.all[overlap], y.all[overlap], col=col1, pch=shape.all[overlap], cex = point.size)
if(show.chosen == “TRUE”){
text(x.all[overlap], y.all[overlap], labels= names[overlap], pos=4, col=lab1, cex=text.cex, offset = 0.25)
}
ratio.chosen = log2(2^y.all[overlap]/2^x.all[overlap])
names(ratio.chosen) = names[overlap]
assign(paste(“chosen.up”, i, sep=”.”),names[overlap[ratio.chosen>=intercept]])
assign(paste(“chosen.down”, i, sep=”.”),names[overlap[ratio.chosen0){
write.table(as.data.frame(dat1), file=outfile, sep=”\t”, quote=F, row.names=F, col.names=F)
write.table(as.data.frame(universe[universe$Common %in% as.character(unique(dat1)) | universe$Systematic %in% as.character(unique(dat1)),”Systematic”]), file=outfile2, sep=”\t”, quote=F, row.names=F, col.names=F)
}
}
}
if(my.legend == TRUE){
legend(“bottomright”, bty = “n”, fill = chosen.colors, legend = chosen.input.files, cex = legend.size, pt.cex = legend.size)
}
# write out the names of the different points
if(Brown==TRUE){
index.names=universe[index$ind,c(“Common”, “Systematic”)]
index.names$Common=as.character(index.names$Common)
index.names$Common[index.names$Common==”—“]=as.character(index.names$Systematic[index.names$Common==”—“])
if (nrow(index.names)>0){
write.table(as.data.frame(index.names[,1]), file=paste(directory, “/”, yaxis, “_vs_”, xaxis,”_handpicked_genes_”, handpicked.color, “.tsv”, sep=””), sep=”\t”, quote=F, row.names=F, col.names=F)
# write out systematic names
write.table(as.data.frame(index.names[,2]), file=paste(directory, “/”, yaxis, “_vs_”, xaxis,”_handpicked_genes_”, handpicked.color, “systematic.tsv”, sep=””), sep=”\t”, quote=F, row.names=F, col.names=F)
}
}
keep1.names=universe[keep1,c(“Common”, “Systematic”)]
keep1.names$Common=as.character(keep1.names$Common)
keep1.names$Common[keep1.names$Common==”—“]=as.character(keep1.names$Systematic[keep1.names$Common==”—“])
if (nrow(keep1.names)>0){
write.table(as.data.frame(keep1.names[,1]), file=paste(directory, “/”, yaxis, “_vs_”, xaxis,”_”, above.line.color, “_dots.tsv”, sep=””), sep=”\t”, quote=F, row.names=F, col.names=F)
write.table(as.data.frame(keep1.names[,2]), file=paste(directory, “/”, yaxis, “_vs_”, xaxis,”_”, above.line.color, “_dots_systematic.tsv”, sep=””), sep=”\t”, quote=F, row.names=F, col.names=F)
}
keep2.names=universe[keep2,c(“Common”, “Systematic”)]
keep2.names$Common=as.character(keep2.names$Common)
keep2.names$Common[keep2.names$Common==”—“]=as.character(keep2.names$Systematic[keep2.names$Common==”—“])
if (nrow(keep2.names)>0){
write.table(as.data.frame(keep2.names[,1]), file=paste(directory, “/”, yaxis, “_vs_”, xaxis,”_”, below.line.color, “_dots.tsv”, sep=””), sep=”\t”, quote=F, row.names=F, col.names=F)
write.table(as.data.frame(keep2.names[,2]), file=paste(directory, “/”, yaxis, “_vs_”, xaxis,”_”, below.line.color, “_dots_systematic.tsv”, sep=””), sep=”\t”, quote=F, row.names=F, col.names=F)
}
keep3.names=rbind(keep1.names, keep2.names)
if (nrow(keep3.names)>0){
write.table(as.data.frame(keep3.names[,1]), file=paste(directory, “/”, yaxis, “_vs_”, xaxis,”_”, above.line.color, “_and_”, below.line.color, “_dots.tsv”, sep=””), sep=”\t”, quote=F, row.names=F, col.names=F)
write.table(as.data.frame(keep3.names[,2]), file=paste(directory, “/”, yaxis, “_vs_”, xaxis,”_”, above.line.color, “_and_”, below.line.color, “_dots_systematic.tsv”, sep=””), sep=”\t”, quote=F, row.names=F, col.names=F)
}
#display the numbers of upregulated and downregulated genes
total <- dim(universe)[1]
nup <- dim(keep1.names)[1]
ndown <- dim(keep2.names)[1]
text(x=-5, y=5, labels = paste("up",nup, "genes\n",
round(nup/total, digits = 4)*100, "%", sep = " "), cex = 0.7)
text(x=3, y=-4, labels = paste("down",ndown, "genes\n",
round(ndown/total, digits = 4)*100, "%", sep = " "), cex = 0.7)
text(x=-4, y=-4, labels = paste("diff",ndown+nup, "genes\n",
round((ndown+nup)/total, digits = 4)*100, "%", sep = " "), cex = 0.7)
my.plot=recordPlot()
pdf(file=paste(directory, "/", yaxis, "_vs_", xaxis,"_plot.pdf", sep=""), height = 6.5, width = 6.5)
replayPlot(my.plot)
dev.off()
if (enrichment == TRUE){
message(" Calculate enrichment…")
# do the enrichment analysis for the three groups of genes
condition1=yaxis
condition2=xaxis
fold.change1=yratio
fold.change2=xratio
my.chosen.objects = grep("syst", grep("(color|label|files)", grep("^chosen", objects(), value = T), value = T, invert = T), invert = T, value = T)
for (obj in my.chosen.objects){
assign(paste(obj, "syst", sep = "."), as.character(universe[universe$Systematic %in% get(obj) | universe$Common %in% get(obj),c("Systematic")]))
}
reduced.data = universe[,c("Systematic", xaxis, yaxis)]
rownames(reduced.data) = universe$Systematic
above.line.syst=as.character(keep1.names$Systematic)
if (exclusion.quadrant == TRUE){
above.filter.1 = reduced.data[above.line.syst, c(xaxis)]< xaxis.limit
above.filter.2 = reduced.data[above.line.syst, yaxis]0)>0){
above.line.syst = above.line.syst[above.filter == 0]
}
}
below.line.syst=as.character(keep2.names$Systematic)
if (exclusion.quadrant == TRUE){
below.filter.1 = reduced.data[below.line.syst, c(xaxis)]< xaxis.limit
below.filter.2 = reduced.data[below.line.syst, yaxis]0)>0){
below.line.syst = below.line.syst[below.filter == 0]
}
}
above.line.and.below.line.syst=c(above.line.syst, below.line.syst)
if (Brown==TRUE){
handpicked.syst=as.character(index.names$Systematic)
}
#ontology=”BP” #change ontology here (B.P, M.F, C.C)
####################################################
####################################################
####################################################
if (Brown == TRUE){
all.gene.groups = c(“above.line.syst”, “below.line.syst”, “above.line.and.below.line.syst”, “handpicked.syst”)
} else {
all.gene.groups = c(“above.line.syst”, “below.line.syst”, “above.line.and.below.line.syst”)
}
# if (Black.only == TRUE){
# all.gene.groups = c(“chosen.1.syst”, “chosen.2.syst”)
# }
if (no.black == TRUE){
all.gene.groups = all.gene.groups
} else if (black.only == TRUE) {
all.gene.groups = grep(“chosen.*syst”, objects(), value = T)
} else {
all.gene.groups = c(all.gene.groups,grep(“chosen.*syst”, objects(), value = T))
}
# what are the GO enrichments
message(” What are the GO enrichments”)
for (g in all.gene.groups){
#for (ontology in c(“MF”, “BP”, “CC”)){#Delete whatever component I don’t want
for (ontology in c(“MF”, “BP”, “CC”)){
# what are the selected genes?
universeIds=as.character(universe$Systematic)
selectedGenes=get(g)
if(length(selectedGenes) == 0){ next }
params <- new("GOHyperGParams",
geneIds=selectedGenes,
universeGeneIds=universeIds,
annotation="org.Sc.sgd.db",
ontology=ontology,
pvalueCutoff=hgCutoff,
conditional=TRUE,
testDirection="over")
#you have to change ontology="CC"
#to ontology="M.F" or "C.C" or "B.P"
# make a hypergeometric test for overrepresentation
hgOver=hyperGTest(params)
### retrieve genes corresponding to each GO ID
# (GO ID, GO description, p-value)
summaryHgOver=summary(hgOver)
out <- summaryHgOver[,c(1,2,7)]
# retrieve input genes associated with each GO identifier
# use the org.Sc data mapping to get GO terms for each ID
column=paste("GO", ontology, "ID", sep="")
goMaps <- lapply(out[,column], function(x) unlist(mget(x, org.Sc.sgdGO2ALLORFS)))
# it assigns each gene, which surrived the filtering, to all GO terms, which are enriched
goSelected <- lapply(goMaps, function(x) selectedGenes[selectedGenes %in% x])
# append a column to the table
summaryHgOver[,ncol(summaryHgOver)+1]=do.call(rbind, lapply(goSelected, function(x) paste(unique(x), collapse=";")))
# map yeast ORFs to common names
common = lapply(goSelected, function(x){ y = sapply(x, function(z){ zz = as.character(universe[(universe$Systematic %in% z), "Common"] )})})
commonNames=lapply(common, function(x){paste(unique(unlist(strsplit(unlist(x),", "))), collapse=";")})
summaryHgOver[,ncol(summaryHgOver)+1]=do.call(rbind, commonNames)
colnames(summaryHgOver)[(ncol(summaryHgOver)-1):ncol(summaryHgOver)] = c("Systematic", "Common")
# write out a table
gg=gsub("\\.syst", "", g)
gg1 = gsub("\\.$", "", gsub("^\\.", "", unlist(strsplit(gg, "and"))))
if (g == "handpicked.syst") {gg1 = "handpicked"}
gene.colors = paste(sapply(paste(gg1, "color", sep="."), get), collapse = "_and_")
###########################
if (grepl("chosen", gg)){
ch.object = unlist(strsplit(gg1,"\\."))
num = as.numeric(ch.object[length(ch.object)])
ch.label = as.character(ch.object[length(ch.object)-1])
ch.col = get(paste("chosen", num, "color", sep="."))
ch.file = chosen.input.files[num]
if (ch.label == "chosen"){
ch.label = ""
}
gene.colors = paste(toupper(ch.label), "chosen_from", gsub("\\.(txt|tsv)","", ch.file), sep="_")
}
########################################
if (g == "handpicked.syst") { gene.colors = paste ("handpicked", gene.colors,sep ="_" )}
filename=paste(directory, "/", condition1, "_FC", fold.change1,"_vs_", condition2, "_FC", fold.change2,"_", ontology, "_",hgCutoff,"_", gene.colors, "_genes.tsv", sep="")
#print(filename)
## EDIT (Feb 20, 2015): Adding FDR correction
## patrick@jimmy.harvard.edu
summaryHgOver <- cbind(summaryHgOver, FDR=p.adjust(summaryHgOver$Pvalue))
dir.create(file.path(directory, "Full"), showWarnings=F, recursive=T)
dir.create(file.path(directory, "Filtered"), showWarnings=F, recursive=T)
dir.create(file.path(directory, "Full_with_less_columns"), showWarnings=F, recursive=T)
dir.create(file.path(directory, "Filtered_with_less_columns"), showWarnings=F, recursive=T)
filename.full=file.path(dirname(filename), "Full", basename(filename))
filename.filtered=file.path(dirname(filename), "Filtered", gsub("\\.tsv$", ".filtered.tsv", basename(filename)))
filename.full.less=file.path(dirname(filename), "Full_with_less_columns", basename(filename))
filename.filtered.less=file.path(dirname(filename), "Filtered_with_less_columns", gsub("\\.tsv$", ".filtered.tsv", basename(filename)))
write.table(summaryHgOver, file=filename.full, sep="\t", quote=F, row.names=F, col.names=T)
write.table(summaryHgOver[,-c(3,8,9)], file=filename.full.less, sep="\t", quote=F, row.names=F, col.names=T)
summaryHgOver.filtered <- summaryHgOver[summaryHgOver$FDR < FDR.thresh, , drop=F]
write.table(summaryHgOver.filtered, file=filename.filtered, sep="\t", quote=F, row.names=F, col.names=T)
write.table(summaryHgOver.filtered[,-c(3,8,9)], file=filename.filtered.less, sep="\t", quote=F, row.names=F, col.names=T)
}
}
# concatenate the enrichment files
message("Concatenate enrichment files…")
search.directory = directory
####===================================================
files = list.files(search.directory, pattern = "(MF|BP|CC).*genes.tsv", full.names=T)
#files = list.files(search.directory, pattern = "BP.*genes.tsv", full.names=T)
#############===================================================
#####################################
file.list = list()
for (f in files){
file.list[[f]] = read.delim(f, header = F)
}
names(file.list) = gsub("^.*\\/","", names(file.list))
new.list = list()
count = 1
for (i in 1:length(file.list)){
d = t(data.frame(c(names(file.list)[i], rep("", 4))))
colnames(d) = paste("V", 1:5, sep="")
if(nrow(file.list[[i]])=number.of.lines){
new2 = rbind(new[1:(number.of.lines+2),], empty)
} else {
new2 = rbind(new, empty)
}
new.list[[i]] = new2
}
write.table(do.call(rbind, new.list), file=paste(search.directory, “/GO.enrichments.all.tables.”, directory, “.txt”, sep=””), row.names=F, col.names=F, sep=”\t”, quote = F)
}
message(” Done.”)
## Additional: February 20, 2015, patrick@jimmy.harvard.edu
##
Hi Thomas,
I’m not an expert but I’ll have a look through your code on the weekend and see if I can answer your question.
Cheers,
Dave
Dear Dave,
How are you? I thank you so very much for kindly reaching out to me by offering your assistance in helping me earning a PhD in bioinformatics despite being almost blind. I am glad you responded to my post. Since my poor vision makes it hard for me to read and write I would prefer discussing details with you by voice. Do you have Skype? If so, what is your Skype ID? Mine is tfh002. Could you please send me a Skype invite? When would be a good time for us to talk? Where are you located?
I kind of need to have my thesis completed within the next 2 weeks. I already have R scripts. I would need help in having the software summarizing the results for me. It is too much for me to compare all relevant GO (Gene Ontology) terms visually. I also need help in comparing my findings with other publications. There are so many ways to analyze data. I am not sure which approach to use. My R script is 600 lines long and produces almost 200 enrichment files with every single run. I am feeling completely overwhelmed with data.
Could you please respond to me directly via email Skype or phone because my text to speech software has difficulties reading out this website aloud to me. I can communicate much better in the much more familiar interface of G-mail and Skype.
I am very much looking forward to the opportunity to talk to you by voice soon.
I thank you very much in advance for your assistance, time, help, efforts and support.
With very warm regards,
Thomas Hahn
Cell phone: 318 243 3940
Skype ID: tfh002
Email: Thomas.F.Hahn2@gmail.com
Hi Dave,
Excellent post.
I don’t see where the “x” object came from when you convert to a list.
Thank you.
Thanks for spotting the typo! It’s supposed to be
entrez_to_go <- as.list(entrez_object[mapped_genes])
Hi Dave,
Could you please explain, where did the “ËntrezGene” come from prior to line 19, the section where you explain on how to convert into EntrezID?
Cheers!