Quantile normalisation in R

Updated 2015 January 14th to include a slide from Rafael.

From Wikipedia:

In statistics, quantile normalization is a technique for making two distributions identical in statistical properties. To quantile normalize two or more distributions to each other, without a reference distribution, sort as before, then set to the average (usually, arithmetical mean) of the distributions. So the highest value in all cases becomes the mean of the highest values, the second highest value becomes the mean of the second highest values, and so on.

Here, I follow the simple example on Wikipedia using R. Firstly, let’s create the test dataset:

df <- data.frame(one=c(5,2,3,4),
                 two=c(4,1,4,2),
                 three=c(3,4,6,8)
                 )
rownames(df) <- toupper(letters[1:4])

df
#  one two three
#A   5   4     3
#B   2   1     4
#C   3   4     6
#D   4   2     8

Determine the ranks of each column from lowest to highest:

df_rank <- apply(df,2,rank,ties.method="min")
df_rank
#  one two three
#A   4   3     1
#B   1   1     2
#C   2   3     3
#D   3   2     4

Sort the original matrix from lowest to highest:

df_sorted <- data.frame(apply(df, 2, sort))
df_sorted
#  one two three
#1   2   1     3
#2   3   2     4
#3   4   4     6
#4   5   4     8

Calculate the means:

df_mean <- apply(df_sorted, 1, mean)
df_mean
#[1] 2.000000 3.000000 4.666667 5.666667

And finally substitute the means into our ranked matrix:

index_to_mean <- function(my_index, my_mean){
  return(my_mean[my_index])
}

df_final <- apply(df_rank, 2, index_to_mean, my_mean=df_mean)
rownames(df_final) <- toupper(letters[1:4])
df_final
#       one      two    three
#A 5.666667 4.666667 2.000000
#B 2.000000 2.000000 3.000000
#C 3.000000 4.666667 4.666667
#D 4.666667 3.000000 5.666667

Now to put it all together in one function called quantile_normalisation():

quantile_normalisation <- function(df){
  df_rank <- apply(df,2,rank,ties.method="min")
  df_sorted <- data.frame(apply(df, 2, sort))
  df_mean <- apply(df_sorted, 1, mean)
  
  index_to_mean <- function(my_index, my_mean){
    return(my_mean[my_index])
  }
  
  df_final <- apply(df_rank, 2, index_to_mean, my_mean=df_mean)
  rownames(df_final) <- rownames(df)
  return(df_final)
}

#test the function
df <- data.frame(one=c(5,2,3,4),
                 two=c(4,1,4,2),
                 three=c(3,4,6,8)
                 )
rownames(df) <- toupper(letters[1:4])

quantile_normalisation(df)
#       one      two    three
#A 5.666667 4.666667 2.000000
#B 2.000000 2.000000 3.000000
#C 3.000000 4.666667 4.666667
#D 4.666667 3.000000 5.666667

One more example:

df <- as.data.frame(matrix(data = c(2,4,4,5,5,14,4,7,4,8,6,9,3,8,5,8,3,9,3,5), nrow = 5, byrow = TRUE))

df
  V1 V2 V3 V4
1  2  4  4  5
2  5 14  4  7
3  4  8  6  9
4  3  8  5  8
5  3  9  3  5

quantile_normalisation(df)
   V1  V2  V3  V4
1 3.5 3.5 5.0 3.5
2 8.5 8.5 5.0 5.5
3 6.5 5.0 8.5 8.5
4 5.0 5.0 6.5 6.5
5 5.0 6.5 3.5 3.5

The data for the above example was used in the PH525x course, where the process of quantile normalisation was summarised quite nicely in this slide:

(Again my values are different from those in the slides because I used the minimum value for rank() and there are several ties in the ranking. The values above were calculated using a “first” approach (see ?rank in R))

Bioconductor

The preprocessCore package on Bioconductor already has a function for quantile normalisation called normalize.quantiles. Let me illustrate its use with the same example as above:

#install if necessary
source('http://bioconductor.org/biocLite.R')
biocLite('preprocessCore')
#load package
library(preprocessCore)

#the function expects a matrix
#create a matrix using the same example
mat <- matrix(c(5,2,3,4,4,1,4,2,3,4,6,8),
             ncol=3)
mat
#     [,1] [,2] [,3]
#[1,]    5    4    3
#[2,]    2    1    4
#[3,]    3    4    6
#[4,]    4    2    8

#quantile normalisation
normalize.quantiles(mat)
#         [,1]     [,2]     [,3]
#[1,] 5.666667 5.166667 2.000000
#[2,] 2.000000 2.000000 3.000000
#[3,] 3.000000 5.166667 4.666667
#[4,] 4.666667 3.000000 5.666667

Note the difference in values in column two between the Wikipedia example and the normalize.quantiles() function, where the values were tied. In the example on Wikipedia, the minimum is used but in the normalize.quantiles() function, the average is used, ((4.666667 + 5.666667) / 2) = 5.166667.

Conclusions

Use the preprocessCore package and the normalize.quantiles() function for quantile normalisation in R and cite “A comparison of normalization methods for high density oligonucleotide array data based on variance and bias”.

Print Friendly, PDF & Email



Creative Commons License
This work is licensed under a Creative Commons
Attribution 4.0 International License
.
6 comments Add yours
  1. How would you modify your code to quantile normalize a test data using quantiles from the training data e.g. your training data that is quantile normalized is (df_final).

    Thanks very much.

    1. Hi Angel,

      I’m not sure I understand your question. In any case, I would recommend using the preprocessCore package on Bioconductor, instead of my code, for performing quantile normalisation.

      Cheers,

      Dave

  2. Hi Dave,
    data=read.csv(“bk.txt”, sep=”\t”, header=T)
    head(data)
    X adult.endothelial.progenitor.cell alternatively.activated.macrophage
    1 ABCG4 1.17 1.00
    2 AP003391.1 1.00 1.00
    3 ATP5L 170.36 200.45
    4 BCL9L 17.52 1.74
    5 BMPR1APS2 1.04 1.05
    6 C2CD2L 4.44 11.20
    rownames(data)

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.