Updated 2019 October 11th to explain the **index_to_mean** function.

From Wikipedia:

In statistics, quantile normalization is a technique for making two distributions identical in statistical properties. To quantile normalize two or more distributions to each other, without a reference distribution, sort as before, then set to the average (usually, arithmetical mean) of the distributions. So the highest value in all cases becomes the mean of the highest values, the second highest value becomes the mean of the second highest values, and so on.

Here, I follow the simple example on Wikipedia using R. Firstly, let’s create the test dataset:

df <- data.frame(one=c(5,2,3,4), two=c(4,1,4,2), three=c(3,4,6,8) ) rownames(df) <- toupper(letters[1:4]) df # one two three #A 5 4 3 #B 2 1 4 #C 3 4 6 #D 4 2 8

Determine the ranks of each column from lowest to highest:

df_rank <- apply(df,2,rank,ties.method="min") df_rank # one two three #A 4 3 1 #B 1 1 2 #C 2 3 3 #D 3 2 4

Sort the original matrix from lowest to highest:

df_sorted <- data.frame(apply(df, 2, sort)) df_sorted # one two three #1 2 1 3 #2 3 2 4 #3 4 4 6 #4 5 4 8

Calculate the means:

df_mean <- apply(df_sorted, 1, mean) df_mean #[1] 2.000000 3.000000 4.666667 5.666667

And finally substitute the means into our ranked matrix.

index_to_mean <- function(my_index, my_mean){ return(my_mean[my_index]) } df_final <- apply(df_rank, 2, index_to_mean, my_mean=df_mean) rownames(df_final) <- toupper(letters[1:4]) df_final # one two three #A 5.666667 4.666667 2.000000 #B 2.000000 2.000000 3.000000 #C 3.000000 4.666667 4.666667 #D 4.666667 3.000000 5.666667

Let me explain the **index_to_mean** function. We had previously ranked the unnormalised data by sample (columns) and stored the ranks in **df_rank**; these ranks act as an index. For example, if we look at column one of the unnormalised data, the values are 5, 2, 3, and 4. The value 2 is ranked 1st, if we go from lowest to highest, 3 is ranked 2nd and so on. Thus, the rank is 4, 1, 2, 3 for the values 5, 2, 3, and 4. In R (and other programming languages), we can use an index (or indices) to refer to specific value/s in a vector. (Just note that other programming languages may use zero-based indexing but R uses 1-based.)

my_vector <- c(10, 20, 30, 40) my_index <- c(4, 1, 2, 3) my_vector[3] # [1] 30 my_vector[my_index] # [1] 40 10 20 30 index_to_mean <- function(my_index, my_mean){ return(my_mean[my_index]) } index_to_mean(my_index, my_vector) # [1] 40 10 20 30

In the above code, we referred to different values in **my_vector** by using **my_index**, which is what **index_to_mean** is doing. We have a vector of means in **df_mean** and we want to retrieve the means according to the rank. The line **df_final <- apply(df_rank, 2, index_to_mean, my_mean=df_mean)** just performs the above step but on a column to column basis.

Now to put everything together in one function called quantile_normalisation():

quantile_normalisation <- function(df){ df_rank <- apply(df,2,rank,ties.method="min") df_sorted <- data.frame(apply(df, 2, sort)) df_mean <- apply(df_sorted, 1, mean) index_to_mean <- function(my_index, my_mean){ return(my_mean[my_index]) } df_final <- apply(df_rank, 2, index_to_mean, my_mean=df_mean) rownames(df_final) <- rownames(df) return(df_final) } #test the function df <- data.frame(one=c(5,2,3,4), two=c(4,1,4,2), three=c(3,4,6,8) ) rownames(df) <- toupper(letters[1:4]) quantile_normalisation(df) # one two three #A 5.666667 4.666667 2.000000 #B 2.000000 2.000000 3.000000 #C 3.000000 4.666667 4.666667 #D 4.666667 3.000000 5.666667

One more example:

df <- as.data.frame(matrix(data = c(2,4,4,5,5,14,4,7,4,8,6,9,3,8,5,8,3,9,3,5), nrow = 5, byrow = TRUE)) df V1 V2 V3 V4 1 2 4 4 5 2 5 14 4 7 3 4 8 6 9 4 3 8 5 8 5 3 9 3 5 quantile_normalisation(df) V1 V2 V3 V4 1 3.5 3.5 5.0 3.5 2 8.5 8.5 5.0 5.5 3 6.5 5.0 8.5 8.5 4 5.0 5.0 6.5 6.5 5 5.0 6.5 3.5 3.5

The data for the above example was used in the PH525x course, where the process of quantile normalisation was summarised quite nicely in this slide:

```
```.@ewanbirney Neither corresponds to the @Bioconductor implementation of quantile norm (est 2001 and explained below) pic.twitter.com/lCyy6YyNB8

— Rafael A Irizarry (@rafalab) December 18, 2014

(Again my values are different from those in the slides because I used the minimum value for rank() and there are several ties in the ranking. The values above were calculated using a “first” approach (see ?rank in R))

### Bioconductor

The preprocessCore package on Bioconductor already has a function for quantile normalisation called normalize.quantiles. Let me illustrate its use with the same example as above:

#install if necessary source('http://bioconductor.org/biocLite.R') biocLite('preprocessCore') #load package library(preprocessCore) #the function expects a matrix #create a matrix using the same example mat <- matrix(c(5,2,3,4,4,1,4,2,3,4,6,8), ncol=3) mat # [,1] [,2] [,3] #[1,] 5 4 3 #[2,] 2 1 4 #[3,] 3 4 6 #[4,] 4 2 8 #quantile normalisation normalize.quantiles(mat) # [,1] [,2] [,3] #[1,] 5.666667 5.166667 2.000000 #[2,] 2.000000 2.000000 3.000000 #[3,] 3.000000 5.166667 4.666667 #[4,] 4.666667 3.000000 5.666667

Note the difference in values in column two between the Wikipedia example and the normalize.quantiles() function, where the values were tied. In the example on Wikipedia, the minimum is used but in the normalize.quantiles() function, the average is used, ((4.666667 + 5.666667) / 2) = 5.166667.

### Conclusions

Use the preprocessCore package and the normalize.quantiles() function for quantile normalisation in R and cite “A comparison of normalization methods for high density oligonucleotide array data based on variance and bias”.

This work is licensed under a Creative Commons

Attribution 4.0 International License.

Nice post Dave! I have found it very useful!

The explanation is clear and the code is beautiful. It is helpful for me. Thank you!

Thx ! I just improve a part :

mat = apply(mat_ranked,2,function(x) df_mean[x]))

How would you modify your code to quantile normalize a test data using quantiles from the training data e.g. your training data that is quantile normalized is (df_final).

Thanks very much.

Hi Angel,

I’m not sure I understand your question. In any case, I would recommend using the preprocessCore package on Bioconductor, instead of my code, for performing quantile normalisation.

Cheers,

Dave

Hi Dave,

data=read.csv(“bk.txt”, sep=”\t”, header=T)

head(data)

X adult.endothelial.progenitor.cell alternatively.activated.macrophage

1 ABCG4 1.17 1.00

2 AP003391.1 1.00 1.00

3 ATP5L 170.36 200.45

4 BCL9L 17.52 1.74

5 BMPR1APS2 1.04 1.05

6 C2CD2L 4.44 11.20

rownames(data)