Updated 2024 August 5th

edgeR carries out:

Differential expression analysis of RNA-seq expression profiles with biological replication. Implements a range of statistical methodology based on the negative binomial distributions, including empirical Bayes estimation, exact tests, generalized linear models and quasi-likelihood tests. As well as RNA-seq, it be applied to differential signal analysis of other types of genomic data that produce read counts, including ChIP-seq, ATAC-seq, Bisulfite-seq, SAGE and CAGE.

## Installation

Install using `BiocManager::install()`

.

```
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("edgeR")
```

## Normalize library sizes

The function `normLibSizes()`

calculates scaling factors to convert the raw library sizes for a set of sequenced samples into normalized effective library sizes. Additional information on each method is provided in the help page for the function, i.e., `?edgeR::normLibSizes`

:

This function computes scaling factors to convert observed library sizes into normalized library sizes, also called "effective library sizes". The effective library sizes for use in downstream analysis are lib.size * norm.factors where lib.size contains the original library sizes and norm.factors is the vector of scaling factors computed by this function.

The TMM method implements the trimmed mean of M-values method proposed by Robinson and Oshlack (2010). By default, the M-values are weighted according to inverse variances, as computed by the delta method for logarithms of binomial random variables. If refColumn is unspecified, then the column whose count-per-million upper quartile is closest to the mean upper quartile is set as the reference library.

The TMMwsp method stands for "TMM with singleton pairing". This is a variant of TMM that is intended to have more stable performance when the counts have a high proportion of zeros. In the TMM method, genes that have zero count in either library are ignored when comparing pairs of libraries. In the TMMwsp method, the positive counts from such genes are reused to increase the number of features by which the libraries are compared. The singleton positive counts are paired up between the libraries in decreasing order of size and then a slightly modified TMM method is applied to the re-ordered libraries. If refColumn is unspecified, then the column with largest sum of square-root counts is used as the reference library.

RLE is the scaling factor method proposed by Anders and Huber (2010). We call it "relative log expression", as median library is calculated from the geometric mean of all columns and the median ratio of each sample to the median library is taken as the scale factor.

The upperquartile method is the upper-quartile normalization method of Bullard et al (2010), in which the scale factors are calculated from the 75% quantile of the counts for each library, after removing genes that are zero in all libraries. The idea is generalized here to allow normalization by any quantile of the count distributions.

If method="none", then the normalization factors are set to 1.

For symmetry, normalization factors are adjusted to multiply to 1. Rows of object that have zero counts for all columns are removed before normalization factors are computed. The number of such rows does not affect the estimated normalization factors.

## A hypothetical situation

We will recreate a hypothetical situation described in Robinson and Oshlack Genome Biology 2010 to test the different normalisation methods.

There are four samples; columns one and two are the controls and columns three and four are the cases. The total number of transcripts across all the samples is 50. The first 25 transcripts are in all four samples in equal counts. The remaining 25 transcripts are only present in the controls and their counts are the same as the first 25 transcripts.

The four samples have exactly the same depth (500 counts). However, since the case samples have half the number of transcripts than the controls (25 versus 50), the first 25 transcripts are sequenced at twice the depth.

```
control_1 <- rep(10, 50)
control_2 <- rep(10, 50)
case_1 <- c(rep(20, 25),rep(0,25))
case_2 <- c(rep(20, 25),rep(0,25))
df <- data.frame(
cont1 = control_1,
cont2 = control_2,
case1 = case_1,
case2 = case_2
)
head(df)
```

```
cont1 cont2 case1 case2
1 10 10 20 20
2 10 10 20 20
3 10 10 20 20
4 10 10 20 20
5 10 10 20 20
6 10 10 20 20
```

Tail of the dataset.

`tail(df)`

```
cont1 cont2 case1 case2
45 10 10 0 0
46 10 10 0 0
47 10 10 0 0
48 10 10 0 0
49 10 10 0 0
50 10 10 0 0
```

Equal depth.

`colSums(df)`

```
cont1 cont2 case1 case2
500 500 500 500
```

## No normalisation

In order to use {edgeR} we need to create a `DGEList`

object with the groups.

```
library(edgeR)
group <- rep(c('control', 'case'), each = 2)
d <- DGEList(counts=df, group=group)
d
```

```
An object of class "DGEList"
$counts
cont1 cont2 case1 case2
1 10 10 20 20
2 10 10 20 20
3 10 10 20 20
4 10 10 20 20
5 10 10 20 20
45 more rows ...
$samples
group lib.size norm.factors
cont1 control 500 1
cont2 control 500 1
case1 case 500 1
case2 case 500 1
```

Let's run the differential expression analysis without any normalisation step.

```
d <- estimateCommonDisp(d)
# perform the DE test
de <- exactTest(d)
# how many differentially expressed transcripts?
table(p.adjust(de$table$PValue, method="BH")<0.05)
```

```
TRUE
50
```

Without normalisation, every transcript is differentially expressed.

## TMM normalisation

From the edgeR manual:

{edgeR} is concerned with differential expression analysis rather than with the quantification of expression levels. It is concerned with relative changes in expression levels between conditions, but not directly with estimating absolute expression levels.

The

`normLibSizes()`

function normalises the library sizes in such a way to minimise the log-fold changes between the samples for most genes. The default method for computing these scale factors uses a trimmed mean of M-values (TMM) between each pair of samples. We call the product of the original library size and the scaling factor the effective library size, i.e., the normalised library size. The effective library size replaces the original library size in all downstream analyses

Let's test the weighted trimmed mean of M-values method:

```
d_tmm <- normLibSizes(d, method="TMM")
d_tmm
```

```
An object of class "DGEList"
$counts
cont1 cont2 case1 case2
1 10 10 20 20
2 10 10 20 20
3 10 10 20 20
4 10 10 20 20
5 10 10 20 20
45 more rows ...
$samples
group lib.size norm.factors
cont1 control 500 0.7071068
cont2 control 500 0.7071068
case1 case 500 1.4142136
case2 case 500 1.4142136
$common.dispersion
[1] 0.0001005378
$pseudo.counts
cont1 cont2 case1 case2
1 10 10 20 20
2 10 10 20 20
3 10 10 20 20
4 10 10 20 20
5 10 10 20 20
45 more rows ...
$pseudo.lib.size
[1] 500
$AveLogCPM
[1] 15.04175 15.04175 15.04175 15.04175 15.04175
45 more elements ...
```

The effective library size.

```
effective_lib_size <- d_tmm$samples$lib.size * d_tmm$samples$norm.factors
effective_lib_size
```

`[1] 353.5534 353.5534 707.1068 707.1068`

Counts before scaling.

`head(d$counts)`

```
cont1 cont2 case1 case2
1 10 10 20 20
2 10 10 20 20
3 10 10 20 20
4 10 10 20 20
5 10 10 20 20
6 10 10 20 20
```

Scaling with the effective library size eliminates the differences in the first 25 transcripts.

```
t(t(d$counts) / effective_lib_size) |>
head()
```

```
cont1 cont2 case1 case2
1 0.02828427 0.02828427 0.02828427 0.02828427
2 0.02828427 0.02828427 0.02828427 0.02828427
3 0.02828427 0.02828427 0.02828427 0.02828427
4 0.02828427 0.02828427 0.02828427 0.02828427
5 0.02828427 0.02828427 0.02828427 0.02828427
6 0.02828427 0.02828427 0.02828427 0.02828427
```

Perform the differential expression test.

```
d_tmm <- estimateCommonDisp(d_tmm)
d_tmm <- exactTest(d_tmm)
table(p.adjust(d_tmm$table$PValue, method="BH")<0.05)
```

```
FALSE TRUE
25 25
```

Only half of the transcripts are differentially expressed (the last 25 transcripts), which is the correct conclusion.

## RLE normalisation

Let's test the relative log expression normalisation method:

```
RLE <- calcNormFactors(d, method="RLE")
RLE
```

```
An object of class "DGEList"
$counts
c1 c2 p1 p2
1 10 10 20 20
2 10 10 20 20
3 10 10 20 20
4 10 10 20 20
5 10 10 20 20
45 more rows ...
$samples
group lib.size norm.factors
c1 control 500 0.7071068
c2 control 500 0.7071068
p1 patient 500 1.4142136
p2 patient 500 1.4142136
```

```
RLE <- estimateCommonDisp(RLE)
RLE <- exactTest(RLE)
table(p.adjust(RLE$table$PValue, method="BH")<0.05)
```

```
FALSE TRUE
25 25
```

## Upper-quartile normalisation

Lastly let's try the upper-quartile normalisation method:

```
uq <- calcNormFactors(d, method="upperquartile")
uq
```

```
An object of class "DGEList"
$counts
c1 c2 p1 p2
1 10 10 20 20
2 10 10 20 20
3 10 10 20 20
4 10 10 20 20
5 10 10 20 20
45 more rows ...
$samples
group lib.size norm.factors
c1 control 500 0.7071068
c2 control 500 0.7071068
p1 patient 500 1.4142136
p2 patient 500 1.4142136
```

```
uq <- estimateCommonDisp(uq)
uq <- exactTest(uq)
table(p.adjust(uq$table$PValue, method="BH")<0.05)
```

```
FALSE TRUE
25 25
```

## Testing a real dataset

Perform differential gene expression analysis using various normalisation methods on the pnas_expression.txt dataset.

```
my_url <- "https://davetang.org/file/pnas_expression.txt"
data <- read.table(my_url, header=TRUE, sep="\t")
dim(data)
```

`[1] 37435 9`

Prepare a DGEList object.

```
d <- data[,2:8]
rownames(d) <- data[,1]
group <- c(rep("Control",4),rep("DHT",3))
d <- DGEList(counts = d, group=group)
```

Carry out differential gene expression analysis with no normalisation.

```
no_norm <- estimateCommonDisp(d)
no_norm <- exactTest(no_norm)
table(p.adjust(no_norm$table$PValue, method="BH")<0.05)
```

```
FALSE TRUE
33404 4031
```

With TMM normalisation.

```
TMM <- calcNormFactors(d, method="TMM")
TMM
```

```
An object of class "DGEList"
$counts
lane1 lane2 lane3 lane4 lane5 lane6 lane8
ENSG00000215696 0 0 0 0 0 0 0
ENSG00000215700 0 0 0 0 0 0 0
ENSG00000215699 0 0 0 0 0 0 0
ENSG00000215784 0 0 0 0 0 0 0
ENSG00000212914 0 0 0 0 0 0 0
37430 more rows ...
$samples
group lib.size norm.factors
lane1 Control 978576 1.0350786
lane2 Control 1156844 1.0379515
lane3 Control 1442169 1.0287815
lane4 Control 1485604 1.0222095
lane5 DHT 1823460 0.9446243
lane6 DHT 1834335 0.9412769
lane8 DHT 681743 0.9954283
```

```
TMM <- estimateCommonDisp(TMM)
TMM <- exactTest(TMM)
table(p.adjust(TMM$table$PValue, method="BH")<0.05)
```

```
FALSE TRUE
33519 3916
```

With RLE.

```
RLE <- calcNormFactors(d, method="RLE")
RLE
```

```
An object of class "DGEList"
$counts
lane1 lane2 lane3 lane4 lane5 lane6 lane8
ENSG00000215696 0 0 0 0 0 0 0
ENSG00000215700 0 0 0 0 0 0 0
ENSG00000215699 0 0 0 0 0 0 0
ENSG00000215784 0 0 0 0 0 0 0
ENSG00000212914 0 0 0 0 0 0 0
37430 more rows ...
$samples
group lib.size norm.factors
lane1 Control 978576 1.0150010
lane2 Control 1156844 1.0236675
lane3 Control 1442169 1.0345426
lane4 Control 1485604 1.0399724
lane5 DHT 1823460 0.9706692
lane6 DHT 1834335 0.9734955
lane8 DHT 681743 0.9466713
```

```
RLE <- estimateCommonDisp(RLE)
RLE <- exactTest(RLE)
table(p.adjust(RLE$table$PValue, method="BH")<0.05)
```

```
FALSE TRUE
33465 3970
```

With the upper quartile method.

```
uq <- calcNormFactors(d, method="upperquartile")
uq
```

```
An object of class "DGEList"
$counts
lane1 lane2 lane3 lane4 lane5 lane6 lane8
ENSG00000215696 0 0 0 0 0 0 0
ENSG00000215700 0 0 0 0 0 0 0
ENSG00000215699 0 0 0 0 0 0 0
ENSG00000215784 0 0 0 0 0 0 0
ENSG00000212914 0 0 0 0 0 0 0
37430 more rows ...
$samples
group lib.size norm.factors
lane1 Control 978576 1.0272514
lane2 Control 1156844 1.0222982
lane3 Control 1442169 1.0250528
lane4 Control 1485604 1.0348864
lane5 DHT 1823460 0.9728534
lane6 DHT 1834335 0.9670858
lane8 DHT 681743 0.9541011
```

```
uq <- estimateCommonDisp(uq)
uq <- exactTest(uq)
table(p.adjust(uq$table$PValue, method="BH")<0.05)
```

```
FALSE TRUE
33466 3969
```

Finding the overlaps between the differential gene expression analyses.

```
library(gplots)
get_de <- function(x, pvalue){
my_i <- p.adjust(x$PValue, method="BH") < pvalue
row.names(x)[my_i]
}
my_de_no_norm <- get_de(no_norm$table, 0.05)
my_de_tmm <- get_de(TMM$table, 0.05)
my_de_rle <- get_de(RLE$table, 0.05)
my_de_uq <- get_de(uq$table, 0.05)
gplots::venn(list(no_norm = my_de_no_norm, TMM = my_de_tmm, RLE = my_de_rle, UQ = my_de_uq))
```

*There is a list of genes (n = 405) only detected as differentially expressed in the no normalisation set. Furthermore, there is a common list of genes (n = 342) detected as differentially expressed in the normalisation set*.

`gplots::venn(list(TMM = my_de_tmm, RLE = my_de_rle, UQ = my_de_uq))`

*There is a large overlap of differentially expressed genes in the different normalisation approaches*.

## Summary

The normalisation factors were quite similar between all normalisation methods, which is why the results of the differential expression were quite concordant. Most methods down sized the DHT samples with a normalisation factor of less than one to account for the larger library sizes of these samples.

This work is licensed under a Creative Commons

Attribution 4.0 International License.

Hi, Dave Tang

Would you like explain for me about TMM method?

I do not know the formula for calculating the column norm.factor?

Thanks.

This is the paper describing the TMM method https://www.ncbi.nlm.nih.gov/pubmed/20196867.

Thanks a lot. I understood this issue.

hi davo, could you please provide any script to extract only those genes which are deferentially expressed in each case i.e. 3863 in your case?

Shouldn’t counts be multiplied by norm.factors instead of being divided?

I’ve updated the post to show how the effective library size is calculated, which is via multiplication. The effective library size is used to adjust the counts, which is via division.

Hello Dave,

I will grateful if you include a demo on normalizing rnaseq counts using housekeeping genes.

Thanks