Following the instructions from a previous post, I filtered the pnas_expression.txt dataset and saved the results in "pnas_expression_filtered.tsv" and then performed the differential gene expression analyses using the respective packages.
To run the Perl scripts below, just save the code into a file and name it "something.pl". Then make the file executable by running "chmod 755 something.pl". To execute the file in your current directory "./something.pl".
The input file (pnas_expression_filtered.tsv) looks like this:
ensembl_ID | lane1 | lane2 | lane3 | lane4 | lane5 | lane6 | lane8 |
ENSG00000124208 | 478 | 619 | 628 | 744 | 483 | 716 | 240 |
ENSG00000182463 | 27 | 20 | 27 | 26 | 48 | 55 | 24 |
ENSG00000124201 | 180 | 218 | 293 | 275 | 373 | 301 | 88 |
And finally please refer to the respective Bioconductor packages for more information.
sessionInfo()
sessionInfo() R version 2.15.0 (2012-03-30) Platform: x86_64-pc-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] snow_0.3-8 baySeq_1.10.0 DESeq_1.8.0 locfit_1.5-7 [5] Biobase_2.16.0 BiocGenerics_0.2.0 edgeR_2.6.0 limma_3.12.0 loaded via a namespace (and not attached): [1] annotate_1.34.0 AnnotationDbi_1.18.0 DBI_0.2-5 [4] genefilter_1.38.0 geneplotter_1.34.0 grid_2.15.0 [7] IRanges_1.14.2 lattice_0.20-6 RColorBrewer_1.0-5 [10] RSQLite_0.11.1 splines_2.15.0 stats4_2.15.0 [13] survival_2.36-12 xtable_1.7-0
DESeq
library("DESeq") countsTable <- read.delim("pnas_expression_filtered.tsv", header=TRUE, row.names=1) conds <- c(rep("C",4),rep("T",3)) cds <- newCountDataSet (countsTable, conds) cds <- estimateSizeFactors(cds) sizeFactors(cds) cds <- estimateDispersions(cds) str(fitInfo(cds)) plotDispEsts <- function(cds){ plot(rowMeans(counts(cds, normalized=TRUE)), fitInfo(cds)$perGeneDispEsts, pch = '.', log="xy") xg <- 10^seq(-.5,5,length.out=300) lines(xg,fitInfo(cds)$dispFun(xg),col="red") } plotDispEsts(cds) res <- nbinomTest(cds,"C","T") head(res) plotDE <- function(res) plot(res$baseMean,res$log2FoldChange,log="x",pch=20,cex=.3,col=ifelse(res$padj < .1, "red","black")) plotDE(res) hist(res$pval, breaks=100, col="skyblue", border="slateblue", main="") resSig <- res[res$padj < 0.1,] head( resSig[order(resSig$pval),]) write.table (res, file="results_deseq.txt")
For parsing the results_deseq.txt, I wrote this script, which outputs results with a adjusted p-value of 0.05 or less and sorted by the adjusted p-value:
#!/usr/bin/perl use strict; use warnings; my $usage = "Usage: $0 <deseq.results>\n"; my $infile = shift or die $usage; my %result = (); my $counter = '0'; open(IN,'<',$infile) || die "Could not open $infile: $!\n"; while(<IN>){ chomp; #"id" "baseMean" "baseMeanA" "baseMeanB" "foldChange" "log2FoldChange" "pval" "padj" #"1" "ENSG00000124208" 529.399951591274 598.268222888992 437.575589860984 0.731403696736499 -0.451260176370596 0.00157794968191143 0.0126404575296003 if (/^\"id/){ next; } my ($junk,$id,$base_mean,$base_mean_a,$base_mean_b,$fold_change,$log_2_fold_change,$pval,$padj) = split(/\s+/); next unless $padj <= 0.05; ++$counter; $result{$counter}->{'all'} = $_; $result{$counter}->{'padj'} = $padj; } close(IN); foreach my $num (sort {$result{$a}->{'padj'} <=> $result{$b}->{'padj'}} keys %result){ print $result{$num}->{'all'},"\n"; } exit(0); __END__
edgeR
library(edgeR) data <- read.table("pnas_expression_filtered.tsv", header=TRUE, row.names=1) d <- data[,1:7] rownames(d) <- row.names(data) group <- c(rep("Control",4),rep("Test",3)) d <- DGEList(counts = d, group=group) dim(d) d <- calcNormFactors(d) d d <- estimateCommonDisp(d) d$common.dispersion sqrt(d$common.dispersion) de.com <- exactTest(d) summary(decideTestsDGE(de.com,p.value=0.01)) summary(decideTestsDGE(de.com,p.value=0.05)) com = summary(decideTestsDGE(de.com,p.value=0.05)) com_total = com[1] + com[3] sink("de_common_dispersion.txt") topTags(de.com,n=com_total) sink() getPriorN(d) d <- estimateTagwiseDisp(d, prop.used=0.5, grid.length=500) png(file="tgw_disp_vs_log_conc.png", height=1200, width=1200) plot(log2(d$conc$conc.common),d$tagwise.dispersion,panel.first=grid()) abline(h=d$common.dispersion,col="dodgerblue",lwd=3) dev.off() de.tgw <- exactTest(d) topTags(de.tgw) summary(decideTestsDGE(de.tgw, p.value=0.01)) summary(decideTestsDGE(de.tgw, p.value=0.05)) bob = summary(decideTestsDGE(de.tgw,p.value=0.05)) total = bob[1] + bob[3] sink("de_tagwise_dispersion.txt") topTags(de.tgw,n=total) sink()
baySeq
library(baySeq) data <- read.delim("pnas_expression_filtered.tsv",header=T) replicates <- c(1,1,1,1,2,2,2) groups <- list(NDE = c(1,1,1,1,1,1,1),DE = c(1,1,1,1,2,2,2)) cname <- data[,1] data <- data[,-1] data = as.matrix(data) CD <- new("countData", data = data, replicates = replicates, groups = groups) CD@libsizes <- getLibsizes(CD) plotMA.CD(CD, samplesA = "1", samplesB = "2", col = c(rep("red", 100), rep("black", 900))) library(snow) cl <- makeCluster(8, "SOCK") CD@annotation <- as.data.frame(cname) CD <- getPriors.NB(CD, samplesize = 10000, estimation = "QL", cl = cl) CD <- getLikelihoods.NB(CD, pET = 'BIC', cl = cl) CD@estProps topCounts(CD, group = "DE") blah <- topCounts(CD,group="DE",FDR=0.05) write.csv(blah, file="bayseq_de.csv")
Results
Number of statistically significant (adjusted p-value of 0.05 or less) differentially expressed genes
DESeq: 2,796
edgeR (tagwise): 4,453
baySeq: 2,847
To compare results, I wrote this Perl script:
#!/usr/bin/perl #Compares two lists and reports entries in both files use strict; use warnings; my $usage = "Usage $0 <firstList> <secondList> <in or out>\n"; my $first = shift or die $usage; my $second = shift or die $usage; my $option = shift or die $usage; die $usage unless $option =~ /[inout]+/i; my %firstList = (); my %secondList = (); open (IN, $first) or die "Can't open $first\n"; while (<IN>){ chomp; $firstList{$_} = $_; } close (IN); open (IN, $second) or die "Can't open $second\n"; while (<IN>){ chomp; $secondList{$_} = $_; } close (IN); if ($option =~ /^[in]+/i){ for my $blah (keys %firstList){ if (exists $secondList{$blah}){ print "$blah is in both lists\n"; } } } elsif ($option =~ /^[out]+/i){ foreach my $first (keys %firstList){ print "<$first\n" unless exists $secondList{$first}; } foreach my $second (keys %secondList){ print ">$second\n" unless exists $firstList{$second}; } }
Store the top 100 DE genes from the respective packages:
parse_deseq.pl results_deseq.txt | cut -f2 -d ' ' | sed 's/"//g;' | head -100 > deseq.list cat de_tagwise_dispersion.txt | grep -v "Comparison" | grep -v "logFC" | awk '{print $1}' | head -100 > edger.list cat bayseq_de.csv | grep -v "cname" | cut -f2 -d',' | sed 's/"//g' | head -100 > bayseq.list
Comparing the top 100 differentially expressed genes
compare_list.pl deseq.list edger.list in | wc -l 88 compare_list.pl deseq.list bayseq.list in | wc -l 65 compare_list.pl edger.list bayseq.list in | wc -l 60
When comparing the top 100 (ranked by adjusted p-values) list from DESeq and edgeR, we see an overlap of 88 / 100 of the candidates. DESeq and baySeq had an overlap of 65 / 100 and edgeR and baySeq had an overlap of 60 / 100.
Store the top 500 DE genes from the respective packages:
parse_deseq.pl results_deseq.txt | cut -f2 -d ' ' | sed 's/"//g;' | head -500 > deseq_500.list cat de_tagwise_dispersion.txt | grep -v "Comparison" | grep -v "logFC" | awk '{print $1}' | head -500 > edger_500.list cat bayseq_de.csv | grep -v "cname" | cut -f2 -d',' | sed 's/"//g' | head -500 > bayseq_500.list
Comparing the top 500 differentially expressed genes
compare_list.pl deseq_500.list edger_500.list in | wc -l 450 compare_list.pl deseq_500.list bayseq_500.list in | wc -l 424 compare_list.pl edger_500.list bayseq_500.list in | wc -l 440
When comparing the top 500 (ranked by adjusted p-values) list from DESeq and edgeR, we see an overlap of 450 / 500 of the candidates. DESeq and baySeq had an overlap of 424 / 500 and edgeR and baySeq had an overlap of 440 / 500.
For comparison with a non-parametric method, see this post.

This work is licensed under a Creative Commons
Attribution 4.0 International License.
This is a great source of information! Have you thought about adding this as a tutorial on the re-released Biostars forum?
Hi Ian,
I haven’t really thought about it but if I were to include it as a tutorial, I think I need to greatly expand on the statistical intrepretation of the results. And to be honest, I don’t feel qualified. But thanks for the nice compliment!
Dave
Dave, have you though of adding voom (now part of limma) to this comparison? i’m currently trying to wrangle some bacterial RNAseq data, and although I can get good separation (based on a plotMDS plot), my results are really not what I expected (ie, I cant answer the biological question i know is there in the data I have), and all packages seem to give me huge numbers of differentially expressed genes.
Hey Jason!
I have not used limma since my impression was that limma can only be used for homoscedastic data (e.g. microarrays), whereas count data is heteroscedastic. But perhaps voom deals with this, so I’ll have a look at it some time.
How consistent are your results from the various packages? I guess if there is some consistency, then perhaps they are reliable.
Since all these methods are parametric, you can try a non-parametric method such as SAMSeq (http://davetang.org/muse/2012/01/25/edger-vs-samseq/) or cuffdiff (something I intend to look at later).
Good luck!
Dave
I recently came across these posts, you have some interesting approaches (and I learned about pvclust), have you tried any PCA based approaches (something like sparse pca) for NGS (RNA seq read count) analysis ?
Hi Sudeep,
I’ve only tried traditional PCA (using prcomp() in R). I’m learning new things along the way and perhaps sometime later I may come across PCA based approaches.
Cheers,
Dave
Thanks! Found it very useful.
Thanks! Please do note that this post is getting quite old; hopefully I find some time to update it soon.
Dave, this post is very helpful, I learned much from it, though I can’t seem to get the edgeR get to work.
This line is giving me an error :
plot(log2(d$conc$conc.common),d$tagwise.dispersion,panel.first=grid())
Error in log2(d$conc$conc.common) :
Non-numeric argument to mathematical function
Someone has any idea ?
Hi Serj,
Thanks for the comment. This post is quite old; if you use the latest version of edgeR (edgeR_3.0.8), you can use the plotBCV() function to plot the tagwise dispersions against the log2-CPM.
Have a look at the Bioconductor page http://www.bioconductor.org/packages/release/bioc/html/edgeR.html and the edgeRUsersGuide.pdf file.
Hope that helps,
Dave
Hi Dave
The post is useful
But when I ran
“parse_deseq.pl results_deseq.txt | cut -f2 -d ‘ ‘ | sed ‘s/”//g;’ | head -100 > deseq.list”
I have the error,
“Use of uninitialized value $padj in numeric le (<=) at parse_deseq1.pl line 22, line 1115”
My dataset “results_deseq.txt”
“”,”id”,”baseMean”,”baseMeanA”,”baseMeanB”,”foldChange”,”log2FoldChange”,”pval”,”padj”
“1”,”FBgn0000003″,0,0,0,NA,NA,NA,NA
“2”,”FBgn0000008″,592.702193588692,603.200284272345,582.204102905039,0.965192023421152,-0.0511121020629761,0.69403015318515,1
“3”,”FBgn0000014″,96.127484977601,83.5990305258537,108.655939429348,1.29972726652309,0.378208921319469,0.221523493353752,1
I don’t why I get this bug?
Thank you!
Hello,
I think what’s causing the error is that the script is splitting the lines by spaces, but your file is comma delimited. Change line 22 to this:
my ($junk,$id,$base_mean,$base_mean_a,$base_mean_b,$fold_change,$log_2_fold_change,$pval,$padj) = split(/,/);
Cheers,
Dave
Thank you Dave. Now the error
Argument “NA” isn’t numeric in numeric le (<=) at parse_deseq1.pl line 22
Do I need to preprocess to remove these genes cos the fold change is 0.
The error is caused because the script expects the adjusted p-value to be a number, so it complains when it comes across the NA string. Add this line of code to skip the NAs:
next if $padj eq 'NA';
You can add it just above this line:
next unless $padj <= 0.05;
Sorry Still have the error
Argument “”padj”” isn’t numeric in numeric le (<=) at parse_deseq1.pl line 23, line 1.
Sorry I am not familiar with Perl.Thank you!
Third time lucky:
my ($junk,$id,$base_mean,$base_mean_a,$base_mean_b,$fold_change,$log_2_fold_change,$pval,$padj) = split(/\s+/);
next unless $padj =~ /^[0-9.]+$/;
next unless $padj <= 0.05;
It will skip lines when $padj isn't made up of digits and periods.
Thank you, working