I've updated this post (2013 June 29th) to use the latest version of R, Bioconductor and edgeR. I also demonstrate how results of edgeR can be saved and outputted into one useful table.
The DGEList object holds the dataset to be analysed by edgeR and the subsequent calculations performed on the dataset. Specifically it contains:
counts | numeric matrix containing the read counts. |
lib.size | numeric vector containing the total to normalize against for each sample (optional) |
norm.factors | numeric vector containing normalization factors (optional, defaults to all 1) |
group | vector giving the experimental group/condition for each sample/library |
genes | data frame containing annotation information for the tags/transcripts/genes for which we have count data (optional). |
remove.zeros | whether to remove rows that have 0 total count; default is FALSE so as to retain all information in the dataset |
After calling the function estimateCommonDisp the DGEList object contains several new elemenets. Straight from the manual:
The output of estimateCommonDisp is a DGEList object with several new elements. The element common.dispersion, as the name suggests, provides the estimate of the common dispersion, and pseudo.alt gives the pseudocounts calculated under the alternative hypothesis. The element genes contains the information about gene/tag identifiers. The element conc gives the estimates of the overall concentration of each tag across all of the original samples (conc$conc.common) and the estimate of the concentration of each tag within each group (conc$conc.group). The element common.lib.size gives the library size to which the original libraries have been adjusted in the pseudocounts.
Here I use edgeR for a differential gene expression analysis and show how I access the DGEList and the results of the exact test and save it all into an expanded data frame:
# install DESeq if necessary source("http://bioconductor.org/biocLite.R") biocLite("DESeq") # load DESeq library("DESeq") # use example data from DESeq example_file <- system.file ("extra/TagSeqExample.tab", package="DESeq") data <- read.delim(example_file, header=T, row.names="gene") # check out data dim(data) [1] 18760 6 head(data) T1a T1b T2 T3 N1 N2 Gene_00001 0 0 2 0 0 1 Gene_00002 20 8 12 5 19 26 Gene_00003 3 0 2 0 0 0 Gene_00004 75 84 241 149 271 257 Gene_00005 10 16 4 0 4 10 Gene_00006 129 126 451 223 243 149 summary(rowSums(data)) Min. 1st Qu. Median Mean 3rd Qu. Max. 0 31 228 1521 883 234500 # let's get rid of some lowly expressed genes data_subset <- data[rowSums(data)>10,] dim(data_subset) [1] 15716 6 # classic edgeR for differential expression # install if necessary source("http://bioconductor.org/biocLite.R") biocLite("edgeR") # load edgeR library(edgeR) # create groups group <- c(rep("Tumour",4),rep("Control",2)) # create the DGEList object and calculate variance d <- DGEList(counts = data_subset, group=group) d <- calcNormFactors(d) d <- estimateCommonDisp(d,verbose=TRUE) Disp = 0.57717 , BCV = 0.7597 d <- estimateTagwiseDisp(d) de.tgw <- exactTest(d) summary(decideTestsDGE(de.tgw, p.value=0.01)) [,1] -1 93 0 15587 1 36 topTags(de.tgw) Comparison of groups: Tumour-Control logFC logCPM PValue FDR Gene_12457 -11.954146 5.205006 2.725899e-17 4.284023e-13 Gene_14803 -11.181823 4.410806 5.260892e-16 4.134009e-12 Gene_00665 -8.093491 4.579092 8.091297e-15 4.238761e-11 Gene_03846 -6.845510 5.869045 6.247258e-14 2.454548e-10 Gene_01354 -9.611609 6.619495 1.831463e-12 5.756654e-09 Gene_10880 -10.210789 6.010410 2.499662e-12 6.547448e-09 Gene_09044 -8.378516 4.605649 4.742492e-12 1.020125e-08 Gene_18549 -8.151739 3.708076 5.192798e-12 1.020125e-08 Gene_02555 -10.534767 3.859586 8.331151e-12 1.454804e-08 Gene_08310 -5.250630 6.059341 2.135419e-11 3.356024e-08 # now say if you wanted to know the tagwise dispersions for each gene names(d) [1] "counts" "samples" "common.dispersion" "pseudo.counts" "AveLogCPM" "pseudo.lib.size" [7] "prior.n" "tagwise.dispersion" # add tagwise dispersions as a column to original data frame data_subset$twd <- d$tagwise.dispersion head(data_subset) T1a T1b T2 T3 N1 N2 twd Gene_00002 20 8 12 5 19 26 0.7054258 Gene_00004 75 84 241 149 271 257 0.3119328 Gene_00005 10 16 4 0 4 10 1.3231080 Gene_00006 129 126 451 223 243 149 0.3076355 Gene_00007 13 4 21 19 31 4 0.6682897 Gene_00009 202 122 256 43 287 357 0.4313053 # plot the tagwise dispersions # histogram at the end of the post hist(data_subset$twd, breaks=20, xlim=c(0,3)) # now if you wanted to save the fold changes and p values per gene names(de.tgw) [1] "table" "comparison" "genes" head(de.tgw$table) logFC logCPM PValue Gene_00002 -0.71266757 1.950463 0.4919477 Gene_00004 -0.90990666 5.280466 0.1732668 Gene_00005 0.59732915 1.126382 0.8207064 Gene_00006 0.30343120 5.532274 0.7148083 Gene_00007 -0.03886553 1.863605 0.9267819 Gene_00009 -0.84392775 5.626952 0.2764131 data_subset <- cbind(data_subset, de.tgw$table) head(data_subset) T1a T1b T2 T3 N1 N2 twd logFC logCPM PValue Gene_00002 20 8 12 5 19 26 0.7054258 -0.71266757 1.950463 0.4919477 Gene_00004 75 84 241 149 271 257 0.3119328 -0.90990666 5.280466 0.1732668 Gene_00005 10 16 4 0 4 10 1.3231080 0.59732915 1.126382 0.8207064 Gene_00006 129 126 451 223 243 149 0.3076355 0.30343120 5.532274 0.7148083 Gene_00007 13 4 21 19 31 4 0.6682897 -0.03886553 1.863605 0.9267819 Gene_00009 202 122 256 43 287 357 0.4313053 -0.84392775 5.626952 0.2764131 # I want the fdr data_subset$PValue_fdr <- p.adjust(method="fdr",p=data_subset$PValue) head(data_subset) T1a T1b T2 T3 N1 N2 twd logFC logCPM PValue PValue_fdr Gene_00002 20 8 12 5 19 26 0.7054258 -0.71266757 1.950463 0.4919477 1.0000000 Gene_00004 75 84 241 149 271 257 0.3119328 -0.90990666 5.280466 0.1732668 0.8566228 Gene_00005 10 16 4 0 4 10 1.3231080 0.59732915 1.126382 0.8207064 1.0000000 Gene_00006 129 126 451 223 243 149 0.3076355 0.30343120 5.532274 0.7148083 1.0000000 Gene_00007 13 4 21 19 31 4 0.6682897 -0.03886553 1.863605 0.9267819 1.0000000 Gene_00009 202 122 256 43 287 357 0.4313053 -0.84392775 5.626952 0.2764131 0.9593879 # sanity check with the decideTestsDGE() call table(data_subset$PValue_fdr<0.01) FALSE TRUE 15587 129 # to write out this useful data frame write.table(data_subset, file="blah.tsv", quote=FALSE) sessionInfo() R version 3.0.1 (2013-05-16) Platform: x86_64-w64-mingw32/x64 (64-bit) locale: [1] LC_COLLATE=English_Australia.1252 LC_CTYPE=English_Australia.1252 LC_MONETARY=English_Australia.1252 [4] LC_NUMERIC=C LC_TIME=English_Australia.1252 attached base packages: [1] parallel stats graphics grDevices utils datasets methods base other attached packages: [1] edgeR_3.2.3 limma_3.16.5 DESeq_1.12.0 lattice_0.20-15 locfit_1.5-9.1 Biobase_2.20.0 BiocGenerics_0.6.0 loaded via a namespace (and not attached): [1] annotate_1.38.0 AnnotationDbi_1.22.5 DBI_0.2-7 genefilter_1.42.0 geneplotter_1.38.0 grid_3.0.1 [7] IRanges_1.18.1 RColorBrewer_1.0-5 RSQLite_0.11.4 splines_3.0.1 stats4_3.0.1 survival_2.37-4 [13] tools_3.0.1 XML_3.96-1.1 xtable_1.7-1

This work is licensed under a Creative Commons
Attribution 4.0 International License.
This may be an old post now, but it’s very helpful! Thanks!
I’m attempting to reproduce it for a first DE analysis, and am confused because I have two groups (“+” and “-“), but topTags(de.tgw) appears to report analysis of 3 groups… (it says “Comparison of groups: +–“). Am I making a mistake in the following, or misinterpreting topTags(de.tgw)’s output?
samplesOfInterest <- c(minusSamples, plusSamples)
group <- c(rep("-", 3), rep("+", 6))
Samples <- DGEList(counts=select(normalizedCounts, one_of(samplesOfInterest)), group = group)
d <- estimateCommonDisp(Samples,verbose=T)
# Disp = 0.03638, BCV = 0.1907
d <- estimateTagwiseDisp(d)
de.tgw <- exactTest(d)
summary(decideTestsDGE(de.tgw, p.value=0.01))
# [,1]
#-1 35
#0 39117
#1 27
topTags(de.tgw)
#Comparison of groups: +– (BH: WHY ARE THERE 3??)
# logFC logCPM PValue FDR
#ENSMUSG00000047676 -7.071383 4.410832 1.283144e-48 5.027230e-44
#ENSMUSG00000027014 -4.346475 7.093599 6.870511e-45 1.345899e-40
#ENSMUSG00000081824 -5.361503 3.252118 2.072516e-25 2.706637e-21
Hi Ben,
there aren’t three. It’s the separator; for example, look in my post and see “Tumour-Control”.
Cheers,
Dave
woot! (and a bit of embarrassment). Thanks again!
Nice post. Do you know how to extract a DGEList into a .csv file? I filtered the data for count values (p.11 of the manual) and want to export the DGEList into a .csv file so I can do additional analysis.
Thanks. You can use the function write.csv() instead of write.table(), as I have in my post.
Hi Dave,
Is it possible to transpose gene specific tag wise dispersion into column as I am having hard time transposing it!!
> head(e$tagwise.dispersion)
[1] 0.003929900 0.005266797 0.250150536 0.004776984 0.003584760 0.003878662
Hi Anksi,
you will need to create a data.frame if you want the tagwise dispersions as a column. If you follow the example in the post, here’s how you can create a column of tagwise dispersions in a data frame:
df <- data.frame(gene = row.names(d), tgw = d$tagwise.dispersion) head(df) gene tgw 1 Gene_00002 0.7000863 2 Gene_00004 0.3144113 3 Gene_00005 1.2737697 4 Gene_00006 0.3085067 5 Gene_00007 0.6749006 6 Gene_00009 0.4286794
Hope that helps,
Dave
How do I export normalized counts using this?