Updated 2015 January 14th to include a slide from Rafael.

From Wikipedia:

In statistics, quantile normalization is a technique for making two distributions identical in statistical properties. To quantile normalize two or more distributions to each other, without a reference distribution, sort as before, then set to the average (usually, arithmetical mean) of the distributions. So the highest value in all cases becomes the mean of the highest values, the second highest value becomes the mean of the second highest values, and so on.

Here, I follow the simple example on Wikipedia using R. Firstly, let’s create the test dataset:

df <- data.frame(one=c(5,2,3,4), two=c(4,1,4,2), three=c(3,4,6,8) ) rownames(df) <- toupper(letters[1:4]) df # one two three #A 5 4 3 #B 2 1 4 #C 3 4 6 #D 4 2 8

Determine the ranks of each column from lowest to highest:

df_rank <- apply(df,2,rank,ties.method="min") df_rank # one two three #A 4 3 1 #B 1 1 2 #C 2 3 3 #D 3 2 4

Sort the original matrix from lowest to highest:

df_sorted <- data.frame(apply(df, 2, sort)) df_sorted # one two three #1 2 1 3 #2 3 2 4 #3 4 4 6 #4 5 4 8

Calculate the means:

df_mean <- apply(df_sorted, 1, mean) df_mean #[1] 2.000000 3.000000 4.666667 5.666667

And finally substitute the means into our ranked matrix:

index_to_mean <- function(my_index, my_mean){ return(my_mean[my_index]) } df_final <- apply(df_rank, 2, index_to_mean, my_mean=df_mean) rownames(df_final) <- toupper(letters[1:4]) df_final # one two three #A 5.666667 4.666667 2.000000 #B 2.000000 2.000000 3.000000 #C 3.000000 4.666667 4.666667 #D 4.666667 3.000000 5.666667

Now to put it all together in one function called quantile_normalisation():

quantile_normalisation <- function(df){ df_rank <- apply(df,2,rank,ties.method="min") df_sorted <- data.frame(apply(df, 2, sort)) df_mean <- apply(df_sorted, 1, mean) index_to_mean <- function(my_index, my_mean){ return(my_mean[my_index]) } df_final <- apply(df_rank, 2, index_to_mean, my_mean=df_mean) rownames(df_final) <- rownames(df) return(df_final) } #test the function df <- data.frame(one=c(5,2,3,4), two=c(4,1,4,2), three=c(3,4,6,8) ) rownames(df) <- toupper(letters[1:4]) quantile_normalisation(df) # one two three #A 5.666667 4.666667 2.000000 #B 2.000000 2.000000 3.000000 #C 3.000000 4.666667 4.666667 #D 4.666667 3.000000 5.666667

One more example:

df <- as.data.frame(matrix(data = c(2,4,4,5,5,14,4,7,4,8,6,9,3,8,5,8,3,9,3,5), nrow = 5, byrow = TRUE)) df V1 V2 V3 V4 1 2 4 4 5 2 5 14 4 7 3 4 8 6 9 4 3 8 5 8 5 3 9 3 5 quantile_normalisation(df) V1 V2 V3 V4 1 3.5 3.5 5.0 3.5 2 8.5 8.5 5.0 5.5 3 6.5 5.0 8.5 8.5 4 5.0 5.0 6.5 6.5 5 5.0 6.5 3.5 3.5

The data for the above example was used in the PH525x course, where the process of quantile normalisation was summarised quite nicely in this slide:

```
```.@ewanbirney Neither corresponds to the @Bioconductor implementation of quantile norm (est 2001 and explained below) pic.twitter.com/lCyy6YyNB8

— Rafael A Irizarry (@rafalab) December 18, 2014

(Again my values are different from those in the slides because I used the minimum value for rank() and there are several ties in the ranking. The values above were calculated using a “first” approach (see ?rank in R))

### Bioconductor

The preprocessCore package on Bioconductor already has a function for quantile normalisation called normalize.quantiles. Let me illustrate its use with the same example as above:

#install if necessary source('http://bioconductor.org/biocLite.R') biocLite('preprocessCore') #load package library(preprocessCore) #the function expects a matrix #create a matrix using the same example mat <- matrix(c(5,2,3,4,4,1,4,2,3,4,6,8), ncol=3) mat # [,1] [,2] [,3] #[1,] 5 4 3 #[2,] 2 1 4 #[3,] 3 4 6 #[4,] 4 2 8 #quantile normalisation normalize.quantiles(mat) # [,1] [,2] [,3] #[1,] 5.666667 5.166667 2.000000 #[2,] 2.000000 2.000000 3.000000 #[3,] 3.000000 5.166667 4.666667 #[4,] 4.666667 3.000000 5.666667

Note the difference in values in column two between the Wikipedia example and the normalize.quantiles() function, where the values were tied. In the example on Wikipedia, the minimum is used but in the normalize.quantiles() function, the average is used, ((4.666667 + 5.666667) / 2) = 5.166667.

### Conclusions

Use the preprocessCore package and the normalize.quantiles() function for quantile normalisation in R and cite “A comparison of normalization methods for high density oligonucleotide array data based on variance and bias”.

This work is licensed under a Creative Commons

Attribution 4.0 International License.

Nice post Dave! I have found it very useful!

The explanation is clear and the code is beautiful. It is helpful for me. Thank you!

Thx ! I just improve a part :

mat = apply(mat_ranked,2,function(x) df_mean[x]))

How would you modify your code to quantile normalize a test data using quantiles from the training data e.g. your training data that is quantile normalized is (df_final).

Thanks very much.

Hi Angel,

I’m not sure I understand your question. In any case, I would recommend using the preprocessCore package on Bioconductor, instead of my code, for performing quantile normalisation.

Cheers,

Dave

Hi Dave,

data=read.csv(“bk.txt”, sep=”\t”, header=T)

head(data)

X adult.endothelial.progenitor.cell alternatively.activated.macrophage

1 ABCG4 1.17 1.00

2 AP003391.1 1.00 1.00

3 ATP5L 170.36 200.45

4 BCL9L 17.52 1.74

5 BMPR1APS2 1.04 1.05

6 C2CD2L 4.44 11.20

rownames(data)