Quantile normalisation in R

Updated 2019 October 11th to explain the index_to_mean function.

From Wikipedia:

In statistics, quantile normalization is a technique for making two distributions identical in statistical properties. To quantile normalize two or more distributions to each other, without a reference distribution, sort as before, then set to the average (usually, arithmetical mean) of the distributions. So the highest value in all cases becomes the mean of the highest values, the second highest value becomes the mean of the second highest values, and so on.

Here, I follow the simple example on Wikipedia using R. Firstly, let’s create the test dataset:

df <- data.frame(one=c(5,2,3,4),
                 two=c(4,1,4,2),
                 three=c(3,4,6,8)
                 )
rownames(df) <- toupper(letters[1:4])

df
#  one two three
#A   5   4     3
#B   2   1     4
#C   3   4     6
#D   4   2     8

Determine the ranks of each column from lowest to highest:

df_rank <- apply(df,2,rank,ties.method="min")
df_rank
#  one two three
#A   4   3     1
#B   1   1     2
#C   2   3     3
#D   3   2     4

Sort the original matrix from lowest to highest:

df_sorted <- data.frame(apply(df, 2, sort))
df_sorted
#  one two three
#1   2   1     3
#2   3   2     4
#3   4   4     6
#4   5   4     8

Calculate the means:

df_mean <- apply(df_sorted, 1, mean)
df_mean
#[1] 2.000000 3.000000 4.666667 5.666667

And finally substitute the means into our ranked matrix.

index_to_mean <- function(my_index, my_mean){
  return(my_mean[my_index])
}

df_final <- apply(df_rank, 2, index_to_mean, my_mean=df_mean)
rownames(df_final) <- toupper(letters[1:4])
df_final
#       one      two    three
#A 5.666667 4.666667 2.000000
#B 2.000000 2.000000 3.000000
#C 3.000000 4.666667 4.666667
#D 4.666667 3.000000 5.666667

Let me explain the index_to_mean function. We had previously ranked the unnormalised data by sample (columns) and stored the ranks in df_rank; these ranks act as an index. For example, if we look at column one of the unnormalised data, the values are 5, 2, 3, and 4. The value 2 is ranked 1st, if we go from lowest to highest, 3 is ranked 2nd and so on. Thus, the rank is 4, 1, 2, 3 for the values 5, 2, 3, and 4. In R (and other programming languages), we can use an index (or indices) to refer to specific value/s in a vector. (Just note that other programming languages may use zero-based indexing but R uses 1-based.)

my_vector <- c(10, 20, 30, 40)
my_index  <- c(4, 1, 2, 3)

my_vector[3]
# [1] 30

my_vector[my_index]
# [1] 40 10 20 30

index_to_mean <- function(my_index, my_mean){
  return(my_mean[my_index])
}

index_to_mean(my_index, my_vector)
# [1] 40 10 20 30

In the above code, we referred to different values in my_vector by using my_index, which is what index_to_mean is doing. We have a vector of means in df_mean and we want to retrieve the means according to the rank. The line df_final <- apply(df_rank, 2, index_to_mean, my_mean=df_mean) just performs the above step but on a column to column basis.

Now to put everything together in one function called quantile_normalisation():

quantile_normalisation <- function(df){
  df_rank <- apply(df,2,rank,ties.method="min")
  df_sorted <- data.frame(apply(df, 2, sort))
  df_mean <- apply(df_sorted, 1, mean)
  
  index_to_mean <- function(my_index, my_mean){
    return(my_mean[my_index])
  }
  
  df_final <- apply(df_rank, 2, index_to_mean, my_mean=df_mean)
  rownames(df_final) <- rownames(df)
  return(df_final)
}

#test the function
df <- data.frame(one=c(5,2,3,4),
                 two=c(4,1,4,2),
                 three=c(3,4,6,8)
                 )
rownames(df) <- toupper(letters[1:4])

quantile_normalisation(df)
#       one      two    three
#A 5.666667 4.666667 2.000000
#B 2.000000 2.000000 3.000000
#C 3.000000 4.666667 4.666667
#D 4.666667 3.000000 5.666667

One more example:

df <- as.data.frame(matrix(data = c(2,4,4,5,5,14,4,7,4,8,6,9,3,8,5,8,3,9,3,5), nrow = 5, byrow = TRUE))

df
  V1 V2 V3 V4
1  2  4  4  5
2  5 14  4  7
3  4  8  6  9
4  3  8  5  8
5  3  9  3  5

quantile_normalisation(df)
   V1  V2  V3  V4
1 3.5 3.5 5.0 3.5
2 8.5 8.5 5.0 5.5
3 6.5 5.0 8.5 8.5
4 5.0 5.0 6.5 6.5
5 5.0 6.5 3.5 3.5

The data for the above example was used in the PH525x course, where the process of quantile normalisation was summarised quite nicely in this slide:

(Again my values are different from those in the slides because I used the minimum value for rank() and there are several ties in the ranking. The values above were calculated using a “first” approach (see ?rank in R))

Bioconductor

The preprocessCore package on Bioconductor already has a function for quantile normalisation called normalize.quantiles. Let me illustrate its use with the same example as above:

#install if necessary
source('http://bioconductor.org/biocLite.R')
biocLite('preprocessCore')
#load package
library(preprocessCore)

#the function expects a matrix
#create a matrix using the same example
mat <- matrix(c(5,2,3,4,4,1,4,2,3,4,6,8),
             ncol=3)
mat
#     [,1] [,2] [,3]
#[1,]    5    4    3
#[2,]    2    1    4
#[3,]    3    4    6
#[4,]    4    2    8

#quantile normalisation
normalize.quantiles(mat)
#         [,1]     [,2]     [,3]
#[1,] 5.666667 5.166667 2.000000
#[2,] 2.000000 2.000000 3.000000
#[3,] 3.000000 5.166667 4.666667
#[4,] 4.666667 3.000000 5.666667

Note the difference in values in column two between the Wikipedia example and the normalize.quantiles() function, where the values were tied. In the example on Wikipedia, the minimum is used but in the normalize.quantiles() function, the average is used, ((4.666667 + 5.666667) / 2) = 5.166667.

Conclusions

Use the preprocessCore package and the normalize.quantiles() function for quantile normalisation in R and cite “A comparison of normalization methods for high density oligonucleotide array data based on variance and bias”.

Print Friendly, PDF & Email



Creative Commons License
This work is licensed under a Creative Commons
Attribution 4.0 International License
.
6 comments Add yours
  1. How would you modify your code to quantile normalize a test data using quantiles from the training data e.g. your training data that is quantile normalized is (df_final).

    Thanks very much.

    1. Hi Angel,

      I’m not sure I understand your question. In any case, I would recommend using the preprocessCore package on Bioconductor, instead of my code, for performing quantile normalisation.

      Cheers,

      Dave

  2. Hi Dave,
    data=read.csv(“bk.txt”, sep=”\t”, header=T)
    head(data)
    X adult.endothelial.progenitor.cell alternatively.activated.macrophage
    1 ABCG4 1.17 1.00
    2 AP003391.1 1.00 1.00
    3 ATP5L 170.36 200.45
    4 BCL9L 17.52 1.74
    5 BMPR1APS2 1.04 1.05
    6 C2CD2L 4.44 11.20
    rownames(data)

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.