goseq is a Bioconductor package for performing a gene ontology analysis for RNA-seq and other length biased data. For more information on how to use the package, please refer to the vignette. For more information on the pnas_expression file please refer to my older post.
#install if necessary
source("http://bioconductor.org/biocLite.R")
biocLite("goseq")
library(goseq)
data <- read.table("pnas_expression_filtered.tsv",header=T,row.names=1)
nrow(data)
[1] 16494
head(data)
lane1 lane2 lane3 lane4 lane5 lane6 lane8
ENSG00000124208 478 619 628 744 483 716 240
ENSG00000182463 27 20 27 26 48 55 24
ENSG00000124201 180 218 293 275 373 301 88
ENSG00000124207 76 80 85 97 80 81 37
ENSG00000125835 132 200 200 228 280 204 52
ENSG00000125834 42 60 72 86 131 99 30
d <- data[,1:7]
rownames(d) <- row.names(data)
#install if necessary
source("http://bioconductor.org/biocLite.R")
biocLite("edgeR")
library(edgeR)
group <- c(rep("Control",4),rep("Test",3))
d <- DGEList(counts = d, group=group)
dim(d)
[1] 16494 7
d <- calcNormFactors(d)
d
An object of class "DGEList"
$counts
lane1 lane2 lane3 lane4 lane5 lane6 lane8
ENSG00000124208 478 619 628 744 483 716 240
ENSG00000182463 27 20 27 26 48 55 24
ENSG00000124201 180 218 293 275 373 301 88
ENSG00000124207 76 80 85 97 80 81 37
ENSG00000125835 132 200 200 228 280 204 52
16489 more rows ...
$samples
group lib.size norm.factors
lane1 Control 976847 1.0296636
lane2 Control 1154746 1.0372521
lane3 Control 1439393 1.0362662
lane4 Control 1482652 1.0378383
lane5 Test 1820628 0.9537095
lane6 Test 1831553 0.9525624
lane8 Test 680798 0.9583181
d <- estimateCommonDisp(d)
d$common.dispersion
[1] 0.02001965
sqrt(d$common.dispersion)
[1] 0.1414908
de.com <- exactTest(d)
d <- estimateTagwiseDisp(d)
de.tgw <- exactTest(d)
summary(decideTestsDGE(de.tgw, p.value=0.01))
[,1]
-1 1404
0 13372
1 1718
genes=as.integer(p.adjust(de.tgw$table$PValue[de.tgw$table$logFC!=0],method="BH")<.01)
names(genes)=row.names(de.tgw$table[de.tgw$table$logFC!=0,])
table(genes)
genes
0 1
13372 3122
head(genes)
ENSG00000124208 ENSG00000182463 ENSG00000124201 ENSG00000124207 ENSG00000125835
0 0 0 0 0
ENSG00000125834
0
pwf=nullp(genes,"hg19","ensGene")
The x-axis of the plot above are binned lengths of Ensembl genes and the y-axis is the ratio of differentially detected genes. We can see a clear bias in the detection of differential expression with longer genes.
GO.wall=goseq(pwf,"hg19","ensGene")
head(GO.wall)
category over_represented_pvalue under_represented_pvalue
2269 GO:0005623 1.741568e-52 1
9863 GO:0044464 1.741568e-52 1
9830 GO:0044424 2.533098e-49 1
2268 GO:0005622 2.088365e-48 1
2238 GO:0005575 6.005558e-48 1
2321 GO:0005737 2.845073e-47 1
table(GO.wall$over_represented_pvalue<0.01)
FALSE TRUE
14092 844
GO.samp=goseq(pwf,"hg19","ensGene",method="Sampling",repcnt=1000)
head(GO.samp)
category over_represented_pvalue under_represented_pvalue
36 GO:0000075 0.000999001 1
41 GO:0000082 0.000999001 1
46 GO:0000087 0.000999001 1
67 GO:0000122 0.000999001 1
77 GO:0000139 0.000999001 1
92 GO:0000166 0.000999001 1
table(GO.samp$over_represented_pvalue<0.05)
FALSE TRUE
13413 1523
head(GO.nobias)
category over_represented_pvalue under_represented_pvalue
2269 GO:0005623 1.038052e-67 1
9863 GO:0044464 1.038052e-67 1
2238 GO:0005575 3.014341e-64 1
9830 GO:0044424 3.255426e-60 1
2268 GO:0005622 6.342680e-60 1
3482 GO:0008150 1.916257e-59 1
table(GO.nobias$over_represented_pvalue<0.01)
FALSE TRUE
13961 975
summary(GO.wall$category[p.adjust(GO.wall$over_represented_pvalue,method="BH")<.01])
Length Class Mode
404 character character
library(GO.db)
for(go in wall.enriched.GO[1:5]){
print(GOTERM[[go]])
cat("--------------------------------------\n")
}
GOID: GO:0005623
Term: cell
Ontology: CC
Definition: The basic structural and functional unit of all organisms.
Includes the plasma membrane and any external encapsulating
structures such as the cell wall and cell envelope.
--------------------------------------
GOID: GO:0044464
Term: cell part
Ontology: CC
Definition: Any constituent part of a cell, the basic structural and
functional unit of all organisms.
Synonym: protoplast
--------------------------------------
GOID: GO:0044424
Term: intracellular part
Ontology: CC
Definition: Any constituent part of the living contents of a cell; the
matter contained within (but not including) the plasma membrane,
usually taken to exclude large vacuoles and masses of secretory or
ingested material. In eukaryotes it includes the nucleus and
cytoplasm.
--------------------------------------
GOID: GO:0005622
Term: intracellular
Ontology: CC
Definition: The living contents of a cell; the matter contained within
(but not including) the plasma membrane, usually taken to exclude
large vacuoles and masses of secretory or ingested material. In
eukaryotes it includes the nucleus and cytoplasm.
Synonym: internal to cell
Synonym: nucleocytoplasm
Synonym: protoplasm
Synonym: protoplast
--------------------------------------
GOID: GO:0005575
Term: cellular_component
Ontology: CC
Definition: The part of a cell or its extracellular environment in
which a gene product is located. A gene product may be located in
one or more parts of a cell and its location may be as specific as
a particular macromolecular complex, that is, a stable, persistent
association of macromolecules that function together.
Synonym: cellular component
Synonym: cellular component unknown
Synonym: GO:0008372
Secondary: GO:0008372
--------------------------------------
sessionInfo()
R version 2.15.2 (2012-10-26)
Platform: x86_64-w64-mingw32/x64 (64-bit)
locale:
[1] LC_COLLATE=English_Australia.1252 LC_CTYPE=English_Australia.1252
[3] LC_MONETARY=English_Australia.1252 LC_NUMERIC=C
[5] LC_TIME=English_Australia.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] GO.db_2.8.0 org.Hs.eg.db_2.8.0 RSQLite_0.11.2
[4] DBI_0.2-5 AnnotationDbi_1.20.2 Biobase_2.18.0
[7] BiocGenerics_0.4.0 edgeR_3.0.2 limma_3.14.1
[10] goseq_1.10.0 geneLenDataBase_0.99.10 BiasedUrn_1.04
[13] BiocInstaller_1.8.3
loaded via a namespace (and not attached):
[1] biomaRt_2.14.0 Biostrings_2.26.2 bitops_1.0-4.2
[4] BSgenome_1.26.1 GenomicFeatures_1.10.0 GenomicRanges_1.10.5
[7] grid_2.15.2 IRanges_1.16.4 lattice_0.20-10
[10] Matrix_1.0-10 mgcv_1.7-22 nlme_3.1-105
[13] parallel_2.15.2 RCurl_1.95-3 Rsamtools_1.10.2
[16] rtracklayer_1.18.0 stats4_2.15.2 tools_2.15.2
[19] XML_3.95-0.1 zlibbioc_1.4.0

This work is licensed under a Creative Commons
Attribution 4.0 International License.
