The DGEList object in R

I've updated this post (2013 June 29th) to use the latest version of R, Bioconductor and edgeR. I also demonstrate how results of edgeR can be saved and outputted into one useful table.

The DGEList object holds the dataset to be analysed by edgeR and the subsequent calculations performed on the dataset. Specifically it contains:

counts numeric matrix containing the read counts.
lib.size numeric vector containing the total to normalize against for each sample (optional)
norm.factors numeric vector containing normalization factors (optional, defaults to all 1)
group vector giving the experimental group/condition for each sample/library
genes data frame containing annotation information for the tags/transcripts/genes for which we have count data (optional).
remove.zeros whether to remove rows that have 0 total count; default is FALSE so as to retain all information in the dataset

After calling the function estimateCommonDisp the DGEList object contains several new elemenets. Straight from the manual:

The output of estimateCommonDisp is a DGEList object with several new elements. The element common.dispersion, as the name suggests, provides the estimate of the common dispersion, and pseudo.alt gives the pseudocounts calculated under the alternative hypothesis. The element genes contains the information about gene/tag identifiers. The element conc gives the estimates of the overall concentration of each tag across all of the original samples (conc$conc.common) and the estimate of the concentration of each tag within each group (conc$conc.group). The element common.lib.size gives the library size to which the original libraries have been adjusted in the pseudocounts.

Here I use edgeR for a differential gene expression analysis and show how I access the DGEList and the results of the exact test and save it all into an expanded data frame:

# install DESeq if necessary
source("http://bioconductor.org/biocLite.R")
biocLite("DESeq")

# load DESeq
library("DESeq")

# use example data from DESeq
example_file <- system.file ("extra/TagSeqExample.tab", package="DESeq")
data <- read.delim(example_file, header=T, row.names="gene")

# check out data
dim(data)
[1] 18760     6

head(data)
           T1a T1b  T2  T3  N1  N2
Gene_00001   0   0   2   0   0   1
Gene_00002  20   8  12   5  19  26
Gene_00003   3   0   2   0   0   0
Gene_00004  75  84 241 149 271 257
Gene_00005  10  16   4   0   4  10
Gene_00006 129 126 451 223 243 149

summary(rowSums(data))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      0      31     228    1521     883  234500

# let's get rid of some lowly expressed genes
data_subset <- data[rowSums(data)>10,]
dim(data_subset)
[1] 15716     6

# classic edgeR for differential expression
# install if necessary
source("http://bioconductor.org/biocLite.R")
biocLite("edgeR")

# load edgeR
library(edgeR)

# create groups
group <- c(rep("Tumour",4),rep("Control",2))

# create the DGEList object and calculate variance
d <- DGEList(counts = data_subset, group=group)
d <- calcNormFactors(d)
d <- estimateCommonDisp(d,verbose=TRUE)
Disp = 0.57717 , BCV = 0.7597

d <- estimateTagwiseDisp(d)
de.tgw <- exactTest(d)

summary(decideTestsDGE(de.tgw, p.value=0.01))
   [,1] 
-1    93
0  15587
1     36

topTags(de.tgw)
Comparison of groups:  Tumour-Control 
                logFC   logCPM       PValue          FDR
Gene_12457 -11.954146 5.205006 2.725899e-17 4.284023e-13
Gene_14803 -11.181823 4.410806 5.260892e-16 4.134009e-12
Gene_00665  -8.093491 4.579092 8.091297e-15 4.238761e-11
Gene_03846  -6.845510 5.869045 6.247258e-14 2.454548e-10
Gene_01354  -9.611609 6.619495 1.831463e-12 5.756654e-09
Gene_10880 -10.210789 6.010410 2.499662e-12 6.547448e-09
Gene_09044  -8.378516 4.605649 4.742492e-12 1.020125e-08
Gene_18549  -8.151739 3.708076 5.192798e-12 1.020125e-08
Gene_02555 -10.534767 3.859586 8.331151e-12 1.454804e-08
Gene_08310  -5.250630 6.059341 2.135419e-11 3.356024e-08

# now say if you wanted to know the tagwise dispersions for each gene
names(d)
[1] "counts"             "samples"            "common.dispersion"  "pseudo.counts"      "AveLogCPM"          "pseudo.lib.size"   
[7] "prior.n"            "tagwise.dispersion"

# add tagwise dispersions as a column to original data frame
data_subset$twd <- d$tagwise.dispersion

head(data_subset)
           T1a T1b  T2  T3  N1  N2       twd
Gene_00002  20   8  12   5  19  26 0.7054258
Gene_00004  75  84 241 149 271 257 0.3119328
Gene_00005  10  16   4   0   4  10 1.3231080
Gene_00006 129 126 451 223 243 149 0.3076355
Gene_00007  13   4  21  19  31   4 0.6682897
Gene_00009 202 122 256  43 287 357 0.4313053

# plot the tagwise dispersions
# histogram at the end of the post
hist(data_subset$twd, breaks=20, xlim=c(0,3))

# now if you wanted to save the fold changes and p values per gene
names(de.tgw)
[1] "table"      "comparison" "genes"

head(de.tgw$table)
                 logFC   logCPM    PValue
Gene_00002 -0.71266757 1.950463 0.4919477
Gene_00004 -0.90990666 5.280466 0.1732668
Gene_00005  0.59732915 1.126382 0.8207064
Gene_00006  0.30343120 5.532274 0.7148083
Gene_00007 -0.03886553 1.863605 0.9267819
Gene_00009 -0.84392775 5.626952 0.2764131

data_subset <- cbind(data_subset, de.tgw$table)
head(data_subset)
           T1a T1b  T2  T3  N1  N2       twd       logFC   logCPM    PValue
Gene_00002  20   8  12   5  19  26 0.7054258 -0.71266757 1.950463 0.4919477
Gene_00004  75  84 241 149 271 257 0.3119328 -0.90990666 5.280466 0.1732668
Gene_00005  10  16   4   0   4  10 1.3231080  0.59732915 1.126382 0.8207064
Gene_00006 129 126 451 223 243 149 0.3076355  0.30343120 5.532274 0.7148083
Gene_00007  13   4  21  19  31   4 0.6682897 -0.03886553 1.863605 0.9267819
Gene_00009 202 122 256  43 287 357 0.4313053 -0.84392775 5.626952 0.2764131

# I want the fdr
data_subset$PValue_fdr <- p.adjust(method="fdr",p=data_subset$PValue)
head(data_subset)
           T1a T1b  T2  T3  N1  N2       twd       logFC   logCPM    PValue PValue_fdr
Gene_00002  20   8  12   5  19  26 0.7054258 -0.71266757 1.950463 0.4919477  1.0000000
Gene_00004  75  84 241 149 271 257 0.3119328 -0.90990666 5.280466 0.1732668  0.8566228
Gene_00005  10  16   4   0   4  10 1.3231080  0.59732915 1.126382 0.8207064  1.0000000
Gene_00006 129 126 451 223 243 149 0.3076355  0.30343120 5.532274 0.7148083  1.0000000
Gene_00007  13   4  21  19  31   4 0.6682897 -0.03886553 1.863605 0.9267819  1.0000000
Gene_00009 202 122 256  43 287 357 0.4313053 -0.84392775 5.626952 0.2764131  0.9593879

# sanity check with the decideTestsDGE() call
table(data_subset$PValue_fdr<0.01)

FALSE  TRUE 
15587   129

# to write out this useful data frame
write.table(data_subset, file="blah.tsv", quote=FALSE)

sessionInfo()
R version 3.0.1 (2013-05-16)
Platform: x86_64-w64-mingw32/x64 (64-bit)

locale:
[1] LC_COLLATE=English_Australia.1252  LC_CTYPE=English_Australia.1252    LC_MONETARY=English_Australia.1252
[4] LC_NUMERIC=C                       LC_TIME=English_Australia.1252    

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] edgeR_3.2.3        limma_3.16.5       DESeq_1.12.0       lattice_0.20-15    locfit_1.5-9.1     Biobase_2.20.0     BiocGenerics_0.6.0

loaded via a namespace (and not attached):
 [1] annotate_1.38.0      AnnotationDbi_1.22.5 DBI_0.2-7            genefilter_1.42.0    geneplotter_1.38.0   grid_3.0.1          
 [7] IRanges_1.18.1       RColorBrewer_1.0-5   RSQLite_0.11.4       splines_3.0.1        stats4_3.0.1         survival_2.37-4     
[13] tools_3.0.1          XML_3.96-1.1         xtable_1.7-1

typical_tagwise_dispersions




Creative Commons License
This work is licensed under a Creative Commons
Attribution 4.0 International License
.
Posted in RTagged ,
8 comments Add yours
  1. This may be an old post now, but it’s very helpful! Thanks!

    I’m attempting to reproduce it for a first DE analysis, and am confused because I have two groups (“+” and “-“), but topTags(de.tgw) appears to report analysis of 3 groups… (it says “Comparison of groups: +–“). Am I making a mistake in the following, or misinterpreting topTags(de.tgw)’s output?

    samplesOfInterest <- c(minusSamples, plusSamples)
    group <- c(rep("-", 3), rep("+", 6))
    Samples <- DGEList(counts=select(normalizedCounts, one_of(samplesOfInterest)), group = group)

    d <- estimateCommonDisp(Samples,verbose=T)
    # Disp = 0.03638, BCV = 0.1907

    d <- estimateTagwiseDisp(d)
    de.tgw <- exactTest(d)
    summary(decideTestsDGE(de.tgw, p.value=0.01))
    # [,1]
    #-1 35
    #0 39117
    #1 27

    topTags(de.tgw)
    #Comparison of groups: +– (BH: WHY ARE THERE 3??)
    # logFC logCPM PValue FDR
    #ENSMUSG00000047676 -7.071383 4.410832 1.283144e-48 5.027230e-44
    #ENSMUSG00000027014 -4.346475 7.093599 6.870511e-45 1.345899e-40
    #ENSMUSG00000081824 -5.361503 3.252118 2.072516e-25 2.706637e-21

    1. Hi Ben,

      there aren’t three. It’s the separator; for example, look in my post and see “Tumour-Control”.

      Cheers,

      Dave

  2. Nice post. Do you know how to extract a DGEList into a .csv file? I filtered the data for count values (p.11 of the manual) and want to export the DGEList into a .csv file so I can do additional analysis.

  3. Hi Dave,
    Is it possible to transpose gene specific tag wise dispersion into column as I am having hard time transposing it!!
    > head(e$tagwise.dispersion)
    [1] 0.003929900 0.005266797 0.250150536 0.004776984 0.003584760 0.003878662

    1. Hi Anksi,

      you will need to create a data.frame if you want the tagwise dispersions as a column. If you follow the example in the post, here’s how you can create a column of tagwise dispersions in a data frame:


      df <- data.frame(gene = row.names(d), tgw = d$tagwise.dispersion) head(df) gene tgw 1 Gene_00002 0.7000863 2 Gene_00004 0.3144113 3 Gene_00005 1.2737697 4 Gene_00006 0.3085067 5 Gene_00007 0.6749006 6 Gene_00009 0.4286794

      Hope that helps,

      Dave

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.