edgeR vs. SAMSeq

A while ago I received a comment on comparing parametric methods against nonparametric for calling differential expression in count data. Here I compare SAMSeq (Jun Li and Robert Tibshirani (2011) Finding consistent patterns: a nonparametric approach for identifying differential expression in RNA-Seq data. Statistical Methods in Medical Research, in press.) with edgeR. For more information on SAMSeq refer to the paper. I intend to look at cuffdiff sometime later.

library(edgeR)
raw.data <- read.delim("pnas_expression_filtered.tsv",row.names=1)
d <- raw.data[,1:7]
rownames(d) <- rownames(raw.data)
head(d)
#                lane1 lane2 lane3 lane4 lane5 lane6 lane8
#ENSG00000124208   478   619   628   744   483   716   240
#ENSG00000182463    27    20    27    26    48    55    24
#ENSG00000124201   180   218   293   275   373   301    88
#ENSG00000124207    76    80    85    97    80    81    37
#ENSG00000125835   132   200   200   228   280   204    52
#ENSG00000125834    42    60    72    86   131    99    30
group <- c(rep("Control",4),rep("DHT",3))
d <- DGEList(counts = d, group=group)
#Calculating library sizes from column totals.
d <- calcNormFactors(d)
d <- estimateCommonDisp(d)
de.com <- exactTest(d)
#Comparison of groups:  DHT - Control
summary(decideTestsDGE(de.com,p.value=0.05))
#   [,1]
#-1  2368
#0  11715
#1   2411
d <- estimateTagwiseDisp(d, prop.used=0.5, grid.length=500)
de.tgw <- exactTest(d)
summary(decideTestsDGE(de.tgw, p.value=0.05))
#   [,1]
#-1  2119
#0  12048
#1   2327
#total 4446
de <- summary(decideTestsDGE(de.tgw,p.value=0.05))
total <- de[1] + de[3]
sink("edger.output")
topTags(de.tgw,n=total)
sink()
sessionInfo()
#R version 2.14.1 (2011-12-22)
#Platform: x86_64-unknown-linux-gnu (64-bit)
#
#locale:
#[1] C
#
#attached base packages:
#[1] stats     graphics  grDevices utils     datasets  methods   base
#
#other attached packages:
#[1] edgeR_2.4.1
#
#loaded via a namespace (and not attached):
#[1] limma_3.10.0

For SAMSeq:

Data type? Two class unpaired
Array or Seq data? seq
Arrays centered? FALSE
Delta 2.21025
Minimum fold change 0
Test statistic standard
Are data are log scale? FALSE
Number of permutations 1000
Number of neighbors for KNN 10
Seed for Random number generator 1234567
Estimate of pi0 (proportion of null genes) 0.55
Exchangibility factor s0 0
s0 percentile 0
Quantiles of estimated sequencing depths (0%, 25%, 50%, 75%, 100%) 0.474, 0.822, 1.116, 1.228, 1.311
False Discovery Rate (%) 3.228881

Of the 4,408 significantly differentiated genes detected by SAMSeq using the settings above, 3,908 overlap the 4,446 detected by edgeR using a tagwise dispersion.

See also: edgeR vs. DESeq using this same dataset.

and

http://www.stanford.edu/~junli07/research.html#SAM




Creative Commons License
This work is licensed under a Creative Commons
Attribution 4.0 International License
.
3 comments Add yours
  1. Dear Dave,
    I have asked some questions regarding samseq through email and you kindly help.
    Just an issue about the samseq results:
    This is the command I ran:
    SAMseq(x = counts, y = group, resp.type = “Two class unpaired”,
    nperms = 1000, random.seed = 1234567, fdr.output = 0.05)
    as you see I have chose the FDR = 0.05 but I do not know why in the output I have something like this:
    > samseqThre1[[1]][[4]]
    $genes.up
    Gene ID Gene Name Score(d) Fold Change q-value(%)
    [1,] “g94” “94” “8” “1.626” “16.76”
    [2,] “g1242” “1242” “8” “27.261” “16.76”
    [3,] “g1295” “1295” “8” “1.511” “16.76”
    [4,] “g1365” “1365” “8” “1.465” “16.76”
    [5,] “g1487” “1487” “8” “2.815” “16.76”
    ….
    ….
    as you can see the q value which is suppose to be the lowest FDR in samr is 16.76 % so it means that the lowest FDR regarding to these genes should be 0.1676 which is more than what I chose in my command(0.05). Do you have any hints or suggestions or maybe I am misunderstood.
    Thank you in advance

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.