6th April 2012: For a more updated version of this post, please refer see this post.

### A very simple comparison

Using the TagSeqExample.tab file from the DESeq package as the benchmark dataset. According to DESeq authors, T1a and T1b are similar, so I removed the second column in the file corresponding to T1a:

cut --complement -f2 TagSeqExample.tab > tag_seq_example_less_t1a.tsv

**Hierarchical clustering**

To get an idea of how similar the libraries are, I performed hierarchical clustering using the Spearman correlation of libraries (see here). Note that these raw reads have not been adjusted for sequencing depth, but hopefully by using a ranked correlation method we can bypass this necessity.

From the dendrogram, we observe the marked difference of the normal samples to the tumour samples.

**Installing the R packages**

I then installed all the required packages (R version 2.12.0, biocinstall version 2.7.4.):

source("http://www.bioconductor.org/biocLite.R") biocLite("baySeq") biocLite("DESeq") biocLite("edgeR")

**DESeq**

DESeq code adapted from the vignette:

countsTable <- read.delim("tag_seq_example_less_t1a.tsv", header=TRUE, stringsAsFactors=TRUE) rownames(countsTable) <- countsTable$gene countsTable <- countsTable[,-1] head(countsTable) conds = c("T","T","T","N","N") cds<-newCountDataSet(countsTable,conds) cds<-estimateSizeFactors(cds) sizeFactors(cds)

T1b | T2 | T3 | N1 | N2 |

0.5587394 | 1.5823096 | 1.1270425 | 1.2869337 | 0.8746998 |

cds <- estimateVarianceFunctions (cds) res <- nbinomTest (cds, "T", "N") resSig <- res[ res$padj < .1, ] nrow(resSig) [1] 1072 write.csv(resSig,file="deseq_res_sig.csv")

**edgeR**

First transform the tag_seq_example_less_t1a.tsv file to the edgeR format using this script.

tsv_to_edger.pl tag_seq_example_less_t1a.tsv

library(edgeR) targets = read.delim(file = "targets.txt", stringsAsFactors = FALSE) d = readDGE(targets,comment.char="#") d$samples

files | group | description | lib.size | norm.factors | |

1 | T1b | patient | patient | 2399311 | 1 |

2 | T2 | patient | patient | 7202770 | 1 |

3 | T3 | patient | patient | 5856461 | 1 |

4 | N1 | control | control | 6376307 | 1 |

5 | N2 | control | control | 3931277 | 1 |

Recently edgeR added a normalisation function into the package called calcNormFactors(). For more information see this paper.

d = calcNormFactors(d) d$samples

files | group | description | lib.size | norm.factors | |

1 | T1b | patient | patient | 2399311 | 1.1167157 |

2 | T2 | patient | patient | 7202770 | 0.9978323 |

3 | T3 | patient | patient | 5856461 | 0.9170402 |

4 | N1 | control | control | 6376307 | 0.9546010 |

5 | N2 | control | control | 3931277 | 1.0251550 |

This approach is different from the one taken by DESeq.

d = estimateCommonDisp(d) de.com = exactTest(d) options(digits=4) topTags(de.com) detags.com = rownames(topTags(de.com)$table) sum(de.com$table$p.value < 0.01) [1] 895 sum(p.adjust(de.com$table$p.value,method="BH") < 0.1) [1] 593 good = sum(p.adjust(de.com$table$p.value,method="BH") < 0.1) goodList = topTags(de.com, n=good) sink("edger.txt") goodList sink()

**baySeq**

library(baySeq) all = read.delim("tag_seq_example_less_t1a.tsv", header=TRUE, sep="\t") lib <- c(2399311,7202770,5856461,6376307,3931277) replicates <- c(1,1,1,2,2) groups <- list(NDE = c(1,1,1,1,1), DE = c(1,1,1,2,2) ) cname <- all[,1] all <- all[,-1] all = as.matrix(all) CD <- new("countData", data = all, replicates = replicates, libsizes = as.integer(lib), groups = groups) CD@annotation <- as.data.frame(cname) cl <- NULL CDP.NBML <- getPriors.NB(CD, samplesize = 1000, estimation = "QL", cl = cl) CDPost.NBML <- getLikelihoods.NB(CDP.NBML, pET = 'BIC', cl = cl) CDPost.NBML@estProps [1] 0.8897621 0.1102379 topCounts(CDPost.NBML, group=2) NBML.TPs <- getTPs(CDPost.NBML, group=2, TPs = 1:100) bayseq_de = topCounts(CDPost.NBML, group=2, number=2000) write.csv(bayseq_de, file="bayseq_de.csv")

**Overlap**

R package | Number of differentially expressed genes |

DESeq | 888 |

edgeR | 895 |

baySeq | 1115 |

Total | 18,760 |

DESeq with edgeR | 591 |

DESeq with baySeq | 488 |

edgeR with baySeq | 465 |

DESeq with edgeR with baySeq | 338 |

This is a first draft post and will be continually updated in the future, but I just thought I’ll publish it first.

This work is licensed under a Creative Commons

Attribution 4.0 International License.

Have you tried comparing the output of these parametric methods with the output of non-parametric methods such as CuffDiff?

Thanks for the question Edoardo. I’ve never used CuffDiff before but I would be interested in comparing the results too.

I’ve redone the bayseq analysis using your sample. However, I only found 54 genes (FDR=0.1) are significant instead of 1115.

What value did you get when you ran:

CDPost.NBML@estProps

Also could you show the code you used, just for comparison’s sake.

> CDPost.NBML@estProps

[1] 0.94111663 0.05888337

Here is the code:

http://www.ijbcb.org/people/davo/bayseq.r

I am frist time to use the DESeq. I have some data from fungi.I have some question for the soft.

1) The data I have contained one wild type and seven mutant type, I want to know if I want to analyse

the different express between two kind of mutant, should I put in all data of eight kinds in one time and count two mutant number to analyse or just pick up the two kinds mutant I want to analyse ? And the all the data have no repetition,should I repeat the work again to get more data?

2) I have use the DESeq to count the sample file, but I cannot get the same result of the paper ‘Analysing RNA-Seq data with the “DESeq” package’. And I found the Padj result of commond :

‘res <- nbinomTest( cds, 'N', 'T' )'

'res <- nbinomTest( cds, 'T', 'N' )'

is different. So ,I want to know should I putin the data at some kind of order?

I can not put my result as a accessories to give you, if you need i can send you by Email.

Hi Pei,

Firstly I highly suggest that you email the Bioconductor mailing list (Bioconductor@stat.math.ethz.ch) with your question, where I’m sure the authors of DESeq and/or other more experienced people will assist with your question. In the time being, here are some general comments.

1) While you can proceed with your analysis without replicates using DESeq, you would get a better estimation of the biological variance with more replicates. If you have the budget and time for making additional replicates, I would suggest that you do. My impression is that if you want to test between two kinds of mutants, then you should only load those into R.

2) The first thing to check is whether you are using the latest version of R and Bioconductor and whether you are following the vignette verbatim. Again I suggest you email the Bioconductor mailing list, with the output from sessionInfo().

Hope that helps,

Dave

Thank you for your help!

In fact i had wrote a letter to the auther, maybe they can not receive it for some reason. And i have do the work for 1month to get a result and discuss with you.

You are just right! Only the data input in two kind can be picked out the different express gene data. when all the kind input the result is so few to use. ten thound gene just have less ten different express.

I have upgrade the R and the DESeq and the count the sample file again in different input order. I found the some data is still different, but the DE gene is same.

My experience is that BaySeq gives the most reliable results (compared with DESEQ, DGESEQ and EdgeR), and is also the most flexible methodology when considering the design of the tests (i.e. you can for example also include technical replicates).

Major disadvantages are the computational complexity (takes a lot of time), the fact that the results are very conservative, and the difficulty to get reviewers to understand the concept of likelihood instead of p-value 😉 However, it should be mentioned that there is also a FDR included in BaySeq (later versions), and that in the end you’ll need validation anyway, so if the top-results are indeed the best ones, who cares about a too conservative (i.e. not-significant) test.

Hi Tim, thanks for the comment. I had previously spoken to a statistician, and he had mentioned that the concept behind baySeq was the most difficult to understand compared to DESeq and edgerR. Additionally, I like the idea of including technical replicates. In the past I had simply pooled the technical replicates together. I’ll have a look at baySeq some time soon.

And lastly, I should redo the analysis with updated packages and probably some published dataset.

Just curious as to how you judge “reliable”?

Hi, very nice workthrough for someone beginning with this sort of analysis. One thing, I think you have forgotten:

CDPost.NBML <- getLikelihoods.NB(CDP.NBML, pET = 'BIC', cl = cl)

after getting priors in baySeq (a little puzzle for us=)

Thanks for the post.

Hi Bruce,

Thanks for the comment and the code. It’s been a while since I posted this.

I will try to redo this analysis comparing the three packages with the latest copies and picking some dataset from the SRA.

Thanks again,

Dave

I have use Bayseq for RBA-seq data?but I don“t know how to get the number of DE gene,if anyone knows it ??

Hi dongyu,

Supposing you want to use a FDR threshold of 0.05, use the topCounts() function to output (and save the results into a csv file) your list of differntial expressed genes:

blah <- topCounts(CD,group="DE",FDR=0.05) write.csv(blah, file="bayseq_de.csv") Hope that helps, Dave

thank you very much for your reply?But I am still confused about the meaning of likelihood in the result ?

And as bayseq is not a significant test ?so what does the fdr mean and how to derive the fdr?

Hi dongyu,

This is directly from the vignette:

baySeq uses empirical Bayesian methods to estimate the posterior likelihoods of each set of a set of models that define patterns of differential expression for each row. This approach begins by considering a distribution for the row defined by a set of underlying parameters for which some prior distribution exists. By estimating this prior distribution from the data, we are able to assess, for a given model about the relatedness of our underlying parameters for multiple libraries, the posterior likelihood of the model.

My very limited understanding is that for a simple pairwise comparison (i.e. condition 1 vs. condition 2), you have two models. One model is that for each row (e.g. a gene), there are no differences between condition 1 and 2. The other model is that for each row, condition 1 is different from condition 2. The likelihood is a measure of how well the row fits this model.

As for calculating the FDR, I had a look at the published paper and the vignette but could not find an answer. Perhaps you could email the authors of baySeq or email the Bioconductor mailing list bioconductor (at) r-project.org.

Cheers,

Dave

Hi Dave?

I appreciate about your reply to my question?I think now I know the likelihood?

But the problem of fdr still confuses me?I will look for the answer?As I realise you must be a profession in this field?If you find it later?I hope you can contact with me when you are free?

thanks again?

dongyu?

Hi dongyu,

I am learning these methods just as you are, so I’m just a beginner. For the FDR, please email the authors of baySeq and I’m sure they can help you understand it. But of course if I can find the answer, I’ll let you know.

Good luck!

Dave

Hi Dave?

I contacte the author today?and fortunately receive a rapid reply?

FDR stands for ‘False Discovery Rate’ – this offers an alternativemethod of controling the number of genes you select. Briefly, if you

chose a cutoff of 0.1 for the FDR, this means that 10% of the genes youhave chosen are expected to be false positives.

The FDR is easily calculated from posterior likelihoods by taking thesum of (1 – likelihood) for all the selected genes?

Hope that helps?

But I don‘t really understand the calculation of fdr writen above?if you can understand it?I am appreciate to receive your reply?

yours?

Dongyu

Wow, this blog is amazing! The figures are great. I am working on a similar project to assess DESeq vs EdgeR vs super-basic stuff (e.g. a T-test), and ran across this blog. The Venn diagram is a great idea; I hadn’t even thought of using that before (I was just making ROC plots from simulated data).

Hi Alex,

Thanks for the compliment! I’m just doing very basic things (mostly just following the vignette) since I am trying to understand how the different methods work and compare against each other.

Good luck with your project!

Cheers,

Dave