A quick and short post on parallel distance calculation in R using the mclapply() function from the parallel package. I’ll use data from the Biobase and datamicroarray packages to illustrate.

Install the Biobase package and load the library.

source("http://bioconductor.org/biocLite.R") biocLite("Biobase") library(Biobase)

We’ll use the geneData data set first.

data(geneData, package = "Biobase") dim(geneData) [1] 500 26

How many row by row comparisons?

choose(500,2) [1] 124750

Use the combn() function to generate all pairwise combinations.

my_pair <- combn(1:nrow(geneData), 2, simplify = FALSE) length(my_pair) [1] 124750 head(my_pair) [[1]] [1] 1 2 [[2]] [1] 1 3 [[3]] [1] 1 4 [[4]] [1] 1 5 [[5]] [1] 1 6 [[6]] [1] 1 7

Function to calculate Euclidean distance using a list of indexes.

my_dist <- function(pair, data){ sqrt(sum( (data[pair[1],] - data[pair[2],])^2 )) }

Calculate distances and create distance matrix.

system.time(out <- lapply(my_pair, my_dist, data = geneData)) user system elapsed 0.684 0.000 0.683 # install if necessary # install.packages('parallel') library(parallel) # doesn't save you that much time, but it's a small data set system.time(out2 <- mclapply(my_pair, my_dist, data = geneData, mc.cores = 4)) user system elapsed 1.296 0.752 0.490 length(out2) [1] 124750 # check if they are identical identical(out, out2) [1] TRUE # now to store as a matrix my_matrix <- diag(nrow(geneData)) diag(my_matrix) <- 0 my_matrix[lower.tri(my_matrix)] <- unlist(out2) # this ugly looking bit is due to the matrix filling up by columns instead of rows my_matrix[upper.tri(my_matrix)] <- t(my_matrix)[upper.tri(t(my_matrix))] row.names(my_matrix) <- row.names(geneData) colnames(my_matrix) <- row.names(geneData) # easy way using_dist <- as.matrix(dist(geneData)) my_matrix[1:6, 1:6] AFFX-MurIL2_at AFFX-MurIL10_at AFFX-MurIL4_at AFFX-MurFAS_at AFFX-BioB-5_at AFFX-BioB-M_at AFFX-MurIL2_at 0.0000 296.0283 508.18259 568.03033 351.10147 339.01319 AFFX-MurIL10_at 296.0283 0.0000 333.56006 384.80776 216.84989 204.14356 AFFX-MurIL4_at 508.1826 333.5601 0.00000 80.26229 195.23304 200.74868 AFFX-MurFAS_at 568.0303 384.8078 80.26229 0.00000 248.00470 258.12776 AFFX-BioB-5_at 351.1015 216.8499 195.23304 248.00470 0.00000 87.06816 AFFX-BioB-M_at 339.0132 204.1436 200.74868 258.12776 87.06816 0.00000 using_dist[1:6,1:6] AFFX-MurIL2_at AFFX-MurIL10_at AFFX-MurIL4_at AFFX-MurFAS_at AFFX-BioB-5_at AFFX-BioB-M_at AFFX-MurIL2_at 0.0000 296.0283 508.18259 568.03033 351.10147 339.01319 AFFX-MurIL10_at 296.0283 0.0000 333.56006 384.80776 216.84989 204.14356 AFFX-MurIL4_at 508.1826 333.5601 0.00000 80.26229 195.23304 200.74868 AFFX-MurFAS_at 568.0303 384.8078 80.26229 0.00000 248.00470 258.12776 AFFX-BioB-5_at 351.1015 216.8499 195.23304 248.00470 0.00000 87.06816 AFFX-BioB-M_at 339.0132 204.1436 200.74868 258.12776 87.06816 0.00000 # what!? identical(my_matrix, using_dist) [1] FALSE # read http://stackoverflow.com/questions/9508518/why-are-these-numbers-not-equal all.equal(my_matrix, using_dist) [1] TRUE

Now using a bigger data set.

library(devtools) install_github('ramhiser/datamicroarray') library(datamicroarray) describe_data() author year n p K Disease 1 alon 1999 62 2000 2 Colon Cancer 2 borovecki 2005 31 22283 2 Huntington's Disease 3 burczynski 2006 127 22283 3 Crohn's Disease 4 chiaretti 2004 111 12625 2 Leukemia 5 chin 2006 118 22215 2 Breast Cancer 6 chowdary 2006 104 22283 2 Breast Cancer 7 christensen 2009 217 1413 3 N/A 8 golub 1999 72 7129 3 Leukemia 9 gordon 2002 181 12533 2 Lung Cancer 10 gravier 2010 168 2905 2 Breast Cancer 11 khan 2001 63 2308 4 SRBCT 12 nakayama 2001 105 22283 10 Sarcoma 13 pomeroy 2002 60 7128 2 CNS Tumor 14 shipp 2002 58 6817 2 Lymphoma 15 singh 2002 102 12600 2 Prostate Cancer 16 sorlie 2001 85 456 5 Breast Cancer 17 su 2002 102 5565 4 N/A 18 subramanian 2005 50 10100 2 N/A 19 sun 2006 180 54613 4 Glioma 20 tian 2003 173 12625 2 Myeloma 21 west 2001 49 7129 2 Breast Cancer 22 yeoh 2002 248 12625 6 Leukemia data('gravier', package = 'datamicroarray') dim(gravier$x) [1] 168 2905 data <- t(gravier$x) choose(nrow(data), 2) [1] 4218060 library(parallel) my_pair <- combn(1:nrow(data), 2, simplify = FALSE) my_dist <- function(pair, data){ sqrt(sum( (data[pair[1],] - data[pair[2],])^2 )) } # this is using my 5 year old desktop system.time(out <- lapply(my_pair, my_dist, data = data)) user system elapsed 63.00 0.00 62.99 system.time(out2 <- mclapply(my_pair, my_dist, data = data, mc.cores = 4)) user system elapsed 4.928 0.412 38.759 identical(out, out2) [1] TRUE my_matrix <- diag(nrow(data)) diag(my_matrix) <- 0 my_matrix[lower.tri(my_matrix)] <- unlist(out2) my_matrix[upper.tri(my_matrix)] <- t(my_matrix)[upper.tri(t(my_matrix))] row.names(my_matrix) <- row.names(data) colnames(my_matrix) <- row.names(data) using_dist <- as.matrix(dist(data)) all.equal(my_matrix, using_dist) [1] TRUE

### Summary

It was actually much quicker to use the dist() function.

system.time(using_dist <- as.matrix(dist(data))) user system elapsed 5.476 0.028 5.502

Try a bigger data set.

# this entire section was done on the server; my desktop would struggle data('shipp', package = 'datamicroarray') data <- t(shipp$x) system.time(using_dist <- as.matrix(dist(data))) user system elapsed 23.132 2.972 26.132 my_dist <- function(pair, data){ sqrt(sum( (data[pair[1],] - data[pair[2],])^2 )) } system.time(my_pair <- combn(1:nrow(data), 2, simplify = FALSE)) user system elapsed 31.516 4.540 36.097 system.time(out2 <- mclapply(my_pair, my_dist, data = data, mc.cores = 64)) user system elapsed 1042.256 1298.740 188.634 system.time(out2 <- mclapply(my_pair, my_dist, data = data, mc.cores = 16)) user system elapsed 1361.696 2260.428 111.345 my_matrix <- diag(nrow(data)) diag(my_matrix) <- 0 my_matrix[lower.tri(my_matrix)] <- unlist(out2) my_matrix[upper.tri(my_matrix)] <- t(my_matrix)[upper.tri(t(my_matrix))] row.names(my_matrix) <- row.names(data) colnames(my_matrix) <- row.names(data) all.equal(my_matrix, using_dist) [1] TRUE

In conclusion, if you want to calculate all pairwise Euclidean distances, just use the dist() function. The mclapply() function from the parallel package is faster than lapply(). Using more cores does not necessarily mean faster results; it depends on how big your data is, since time is spent splitting and joining your data set.

This work is licensed under a Creative Commons

Attribution 4.0 International License.

Thank you soo much!

Try using foreach loop construct, it could be faster than mclapply because mclapply can only use cores of 1 machine. Foreach uses doMPI and Rmpi as parallel backend to solve this problem.