I've updated this post (2013 June 29th) to use the latest version of R, Bioconductor and edgeR. I also demonstrate how results of edgeR can be saved and outputted into one useful table.
The DGEList object holds the dataset to be analysed by edgeR and the subsequent calculations performed on the dataset. Specifically it contains:
| counts | numeric matrix containing the read counts. |
| lib.size | numeric vector containing the total to normalize against for each sample (optional) |
| norm.factors | numeric vector containing normalization factors (optional, defaults to all 1) |
| group | vector giving the experimental group/condition for each sample/library |
| genes | data frame containing annotation information for the tags/transcripts/genes for which we have count data (optional). |
| remove.zeros | whether to remove rows that have 0 total count; default is FALSE so as to retain all information in the dataset |
After calling the function estimateCommonDisp the DGEList object contains several new elemenets. Straight from the manual:
The output of estimateCommonDisp is a DGEList object with several new elements. The element common.dispersion, as the name suggests, provides the estimate of the common dispersion, and pseudo.alt gives the pseudocounts calculated under the alternative hypothesis. The element genes contains the information about gene/tag identifiers. The element conc gives the estimates of the overall concentration of each tag across all of the original samples (conc$conc.common) and the estimate of the concentration of each tag within each group (conc$conc.group). The element common.lib.size gives the library size to which the original libraries have been adjusted in the pseudocounts.
Here I use edgeR for a differential gene expression analysis and show how I access the DGEList and the results of the exact test and save it all into an expanded data frame:
# install DESeq if necessary
source("http://bioconductor.org/biocLite.R")
biocLite("DESeq")
# load DESeq
library("DESeq")
# use example data from DESeq
example_file <- system.file ("extra/TagSeqExample.tab", package="DESeq")
data <- read.delim(example_file, header=T, row.names="gene")
# check out data
dim(data)
[1] 18760 6
head(data)
T1a T1b T2 T3 N1 N2
Gene_00001 0 0 2 0 0 1
Gene_00002 20 8 12 5 19 26
Gene_00003 3 0 2 0 0 0
Gene_00004 75 84 241 149 271 257
Gene_00005 10 16 4 0 4 10
Gene_00006 129 126 451 223 243 149
summary(rowSums(data))
Min. 1st Qu. Median Mean 3rd Qu. Max.
0 31 228 1521 883 234500
# let's get rid of some lowly expressed genes
data_subset <- data[rowSums(data)>10,]
dim(data_subset)
[1] 15716 6
# classic edgeR for differential expression
# install if necessary
source("http://bioconductor.org/biocLite.R")
biocLite("edgeR")
# load edgeR
library(edgeR)
# create groups
group <- c(rep("Tumour",4),rep("Control",2))
# create the DGEList object and calculate variance
d <- DGEList(counts = data_subset, group=group)
d <- calcNormFactors(d)
d <- estimateCommonDisp(d,verbose=TRUE)
Disp = 0.57717 , BCV = 0.7597
d <- estimateTagwiseDisp(d)
de.tgw <- exactTest(d)
summary(decideTestsDGE(de.tgw, p.value=0.01))
[,1]
-1 93
0 15587
1 36
topTags(de.tgw)
Comparison of groups: Tumour-Control
logFC logCPM PValue FDR
Gene_12457 -11.954146 5.205006 2.725899e-17 4.284023e-13
Gene_14803 -11.181823 4.410806 5.260892e-16 4.134009e-12
Gene_00665 -8.093491 4.579092 8.091297e-15 4.238761e-11
Gene_03846 -6.845510 5.869045 6.247258e-14 2.454548e-10
Gene_01354 -9.611609 6.619495 1.831463e-12 5.756654e-09
Gene_10880 -10.210789 6.010410 2.499662e-12 6.547448e-09
Gene_09044 -8.378516 4.605649 4.742492e-12 1.020125e-08
Gene_18549 -8.151739 3.708076 5.192798e-12 1.020125e-08
Gene_02555 -10.534767 3.859586 8.331151e-12 1.454804e-08
Gene_08310 -5.250630 6.059341 2.135419e-11 3.356024e-08
# now say if you wanted to know the tagwise dispersions for each gene
names(d)
[1] "counts" "samples" "common.dispersion" "pseudo.counts" "AveLogCPM" "pseudo.lib.size"
[7] "prior.n" "tagwise.dispersion"
# add tagwise dispersions as a column to original data frame
data_subset$twd <- d$tagwise.dispersion
head(data_subset)
T1a T1b T2 T3 N1 N2 twd
Gene_00002 20 8 12 5 19 26 0.7054258
Gene_00004 75 84 241 149 271 257 0.3119328
Gene_00005 10 16 4 0 4 10 1.3231080
Gene_00006 129 126 451 223 243 149 0.3076355
Gene_00007 13 4 21 19 31 4 0.6682897
Gene_00009 202 122 256 43 287 357 0.4313053
# plot the tagwise dispersions
# histogram at the end of the post
hist(data_subset$twd, breaks=20, xlim=c(0,3))
# now if you wanted to save the fold changes and p values per gene
names(de.tgw)
[1] "table" "comparison" "genes"
head(de.tgw$table)
logFC logCPM PValue
Gene_00002 -0.71266757 1.950463 0.4919477
Gene_00004 -0.90990666 5.280466 0.1732668
Gene_00005 0.59732915 1.126382 0.8207064
Gene_00006 0.30343120 5.532274 0.7148083
Gene_00007 -0.03886553 1.863605 0.9267819
Gene_00009 -0.84392775 5.626952 0.2764131
data_subset <- cbind(data_subset, de.tgw$table)
head(data_subset)
T1a T1b T2 T3 N1 N2 twd logFC logCPM PValue
Gene_00002 20 8 12 5 19 26 0.7054258 -0.71266757 1.950463 0.4919477
Gene_00004 75 84 241 149 271 257 0.3119328 -0.90990666 5.280466 0.1732668
Gene_00005 10 16 4 0 4 10 1.3231080 0.59732915 1.126382 0.8207064
Gene_00006 129 126 451 223 243 149 0.3076355 0.30343120 5.532274 0.7148083
Gene_00007 13 4 21 19 31 4 0.6682897 -0.03886553 1.863605 0.9267819
Gene_00009 202 122 256 43 287 357 0.4313053 -0.84392775 5.626952 0.2764131
# I want the fdr
data_subset$PValue_fdr <- p.adjust(method="fdr",p=data_subset$PValue)
head(data_subset)
T1a T1b T2 T3 N1 N2 twd logFC logCPM PValue PValue_fdr
Gene_00002 20 8 12 5 19 26 0.7054258 -0.71266757 1.950463 0.4919477 1.0000000
Gene_00004 75 84 241 149 271 257 0.3119328 -0.90990666 5.280466 0.1732668 0.8566228
Gene_00005 10 16 4 0 4 10 1.3231080 0.59732915 1.126382 0.8207064 1.0000000
Gene_00006 129 126 451 223 243 149 0.3076355 0.30343120 5.532274 0.7148083 1.0000000
Gene_00007 13 4 21 19 31 4 0.6682897 -0.03886553 1.863605 0.9267819 1.0000000
Gene_00009 202 122 256 43 287 357 0.4313053 -0.84392775 5.626952 0.2764131 0.9593879
# sanity check with the decideTestsDGE() call
table(data_subset$PValue_fdr<0.01)
FALSE TRUE
15587 129
# to write out this useful data frame
write.table(data_subset, file="blah.tsv", quote=FALSE)
sessionInfo()
R version 3.0.1 (2013-05-16)
Platform: x86_64-w64-mingw32/x64 (64-bit)
locale:
[1] LC_COLLATE=English_Australia.1252 LC_CTYPE=English_Australia.1252 LC_MONETARY=English_Australia.1252
[4] LC_NUMERIC=C LC_TIME=English_Australia.1252
attached base packages:
[1] parallel stats graphics grDevices utils datasets methods base
other attached packages:
[1] edgeR_3.2.3 limma_3.16.5 DESeq_1.12.0 lattice_0.20-15 locfit_1.5-9.1 Biobase_2.20.0 BiocGenerics_0.6.0
loaded via a namespace (and not attached):
[1] annotate_1.38.0 AnnotationDbi_1.22.5 DBI_0.2-7 genefilter_1.42.0 geneplotter_1.38.0 grid_3.0.1
[7] IRanges_1.18.1 RColorBrewer_1.0-5 RSQLite_0.11.4 splines_3.0.1 stats4_3.0.1 survival_2.37-4
[13] tools_3.0.1 XML_3.96-1.1 xtable_1.7-1

This work is licensed under a Creative Commons
Attribution 4.0 International License.

This may be an old post now, but it’s very helpful! Thanks!
I’m attempting to reproduce it for a first DE analysis, and am confused because I have two groups (“+” and “-“), but topTags(de.tgw) appears to report analysis of 3 groups… (it says “Comparison of groups: +–“). Am I making a mistake in the following, or misinterpreting topTags(de.tgw)’s output?
samplesOfInterest <- c(minusSamples, plusSamples)
group <- c(rep("-", 3), rep("+", 6))
Samples <- DGEList(counts=select(normalizedCounts, one_of(samplesOfInterest)), group = group)
d <- estimateCommonDisp(Samples,verbose=T)
# Disp = 0.03638, BCV = 0.1907
d <- estimateTagwiseDisp(d)
de.tgw <- exactTest(d)
summary(decideTestsDGE(de.tgw, p.value=0.01))
# [,1]
#-1 35
#0 39117
#1 27
topTags(de.tgw)
#Comparison of groups: +– (BH: WHY ARE THERE 3??)
# logFC logCPM PValue FDR
#ENSMUSG00000047676 -7.071383 4.410832 1.283144e-48 5.027230e-44
#ENSMUSG00000027014 -4.346475 7.093599 6.870511e-45 1.345899e-40
#ENSMUSG00000081824 -5.361503 3.252118 2.072516e-25 2.706637e-21
Hi Ben,
there aren’t three. It’s the separator; for example, look in my post and see “Tumour-Control”.
Cheers,
Dave
woot! (and a bit of embarrassment). Thanks again!
Nice post. Do you know how to extract a DGEList into a .csv file? I filtered the data for count values (p.11 of the manual) and want to export the DGEList into a .csv file so I can do additional analysis.
Thanks. You can use the function write.csv() instead of write.table(), as I have in my post.
Hi Dave,
Is it possible to transpose gene specific tag wise dispersion into column as I am having hard time transposing it!!
> head(e$tagwise.dispersion)
[1] 0.003929900 0.005266797 0.250150536 0.004776984 0.003584760 0.003878662
Hi Anksi,
you will need to create a data.frame if you want the tagwise dispersions as a column. If you follow the example in the post, here’s how you can create a column of tagwise dispersions in a data frame:
df <- data.frame(gene = row.names(d), tgw = d$tagwise.dispersion) head(df) gene tgw 1 Gene_00002 0.7000863 2 Gene_00004 0.3144113 3 Gene_00005 1.2737697 4 Gene_00006 0.3085067 5 Gene_00007 0.6749006 6 Gene_00009 0.4286794
Hope that helps,
Dave
How do I export normalized counts using this?