Distance Matrix Computation

Here I demonstrate the distance matrix computations using the R function dist().

Firstly let’s prepare a small dataset to work with:

#set seed to make example reproducible
set.seed(123)
test <- data.frame(x=sample(1:10000,7), 
                   y=sample(1:10000,7), 
                   z=sample(1:10000,7))
test
     x    y    z
1 2876 8925 1030
2 7883 5514 8998
3 4089 4566 2461
4 8828 9566  421
5 9401 4532 3278
6  456 6773 9541
7 5278 5723 8891

How does this dataset look in 3 dimensional space?

s3d <- scatterplot3d(test, color=1:7, pch=19, type="p")
s3d.coords <- s3d$xyz.convert(test)
text(s3d.coords$x, s3d.coords$y, labels=row.names(test), cex=1, pos=4)

s3dWe can see that points 4 and 6 are quite far away from each other.

The first distance matrix computation we’ll calculate will be the Euclidean distance, since it’s the easiest to understand and the default of dist(). The Euclidean distance is simply the distance one would physically measure, say with a ruler. For n-dimensions the formula for the Euclidean distance between points p and q is:

$$!d(p,q) = d(q,p) = \sqrt{ \sum_{i=1}^{n} (p_{i} – q_{i})^2}.$$

#Euclidean distance in R
euclidean_distance <- function(p,q){
   sqrt(sum((p - q)^2))
}
#what is the distance between 1 and 5?
euclidean_distance(1,5)
[1] 4
#what about Cartesian coordinates?
euclidean_distance(c(2,2),c(5.535534,5.535534))
[1] 5
#what about points 4 and 6 from our data frame
euclidean_distance(test[4,],test[6,])
[1] 12691.16
#all pairwise distances in test
dist(test)
          1         2         3         4         5         6
2 10009.695                                                  
3  4745.525  7617.448                                        
4  6017.314  9532.925  7184.687                              
5  8180.928  5998.921  5374.569  5816.523                    
6  9106.296  7552.500  8258.083 12691.164 11147.209          
7  8821.436  2615.560  6640.578  9955.503  7065.648  4977.618

As we can see above in the distance matrix, the distance between points 4 and 6 is 12691 (as we calculated using the formula) and is also the two furthest points (as we saw in the 3D scatterplot).

Next is the “maximum” method. If you type ?dist, you can get the description of maximum

Maximum distance between two components of x and y (supremum norm)

Honestly, I don’t have any idea what that means, so let’s try to find out with an example.

#example
example_1 <- data.frame(x=c(2,6), y=c(4,6))
example_1
plot(example_1,pch=19)
#lines is in the format (x1, x2) (y1, y2)
lines(c(example_1[1,1],example_1[2,1]),c(example_1[1,2],example_1[2,2]))
lines(c(example_1[1,1],example_1[2,1]),c(example_1[1,2],example_1[1,2]))
lines(c(example_1[2,1],example_1[2,1]),c(example_1[1,2],example_1[2,2]))

example_1

#Euclidean distance is the hypotenuse
dist(example_1)
         1
2 4.472136
#seems to take the longest edge of the triangle
#that isn't the hypotenuse
dist(example_1, method="maximum")
  1
2 4

#test with 3 coordinates
test[1:2,]
     x    y    z
1 4014  689 6716
2 5702 7434 9359
dist(test[1:2,])
         1
2 7438.402
dist(test[1:2,],method="maximum")
     1
2 6745
#calculate all edges
straight_distance <- function(p,q){
  sqrt((p-q)^2)
}
#voila
max(straight_distance(test[1,],test[2,]))
[1] 6745

dist(test, method="maximum")
     1    2    3    4    5    6
2 6745                         
3 6065 2503                    
4 6343 8986 6483               
5 3328 3570 2811 9294          
6 3322 5932 5252 6147 5204     
7 1795 6050 5370 7380 3677 1527

#distance from 6 to 7
max(straight_distance(test[6,],test[7,]))
[1] 1527

The Manhattan distance is given as:

$$!\sum_{i=1}^{n} | p_{i} – q_{i}|, $$

where

$$!p = (p_{1}, p_{2},\ldots, p_{n}) \ and \ q = (q_{1}, q_{2},\ldots,q_{n})$$

are vectors.

#following the definition above
manhattan_distance <- function(p,q){
  sum(abs(p-q))
}
#manhattan distance from 6 to 7
manhattan_distance(test[6,],test[7,])
[1] 2878
dist(test, method="manhattan")
      1     2     3     4     5     6
2 11076                              
3  6528  5194                        
4  8072 16072 11142                  
5  8161  7295  7107 12775            
6  4331 10405  9233 10385 10866      
7  3527  7763  8385 10209  8224  2878

The Canberra distance is given as:

$$!d(p,q) = \sum_{i=1}^{n} \frac{| p_{i} – q_{i}|} {|p_{i}| + |q_{i}| }, $$

where

$$!p = (p_{1}, p_{2},\ldots, p_{n}) \ and \ q = (q_{1}, q_{2},\ldots,q_{n})$$

are vectors.

canberra_distance <- function(p,q){
  sum( (abs(p-q)) / (abs(p) + abs(q)) )
}
canberra_distance(test[6,],test[7,])
[1] 0.2434398
dist(test, method="canberra")
          1         2         3         4         5         6
2 1.1685091                                                  
3 0.8670958 0.4163867                                        
4 1.4465730 1.6595871 1.4184359                              
5 1.1935235 0.7702962 0.6919662 1.4963355                    
6 0.6785588 0.9679473 0.9918153 1.4010095 1.1997547          
7 0.5896678 0.7892443 0.9442152 1.3478370 1.0604159 0.2434398

The Minkowski distance is a metric on Euclidean space which can be considered as a generalisation of both the Euclidean distance and the Manhattan distance.

#Euclidean distance
dist(test[1:2,])
         1
2 7438.402
#Minkowski distance with p = 2
dist(test[1:2,],method="minkowski",p=2)
         1
2 7438.402
#Manhattan distance
dist(test[1:2,],method="manhattan")
      1
2 11076
#Minkowski distance with p = 1
dist(test[1:2,],method="minkowski",p=1)
      1
2 11076
#Minkowski distance with p = 1.5
dist(test[1:2,],method="minkowski",p=1.5)
         1
2 8322.055

Clustering based on different distant measures

The most important question for me was how the different distance calculations affected the relationships in the data.

par(mfrow=c(2,3))
plot(hclust(dist(test),method="average"))
plot(hclust(dist(test, method="maximum"),method="average"))
plot(hclust(dist(test, method="manhattan"),method="average"))
plot(hclust(dist(test, method="canberra"),method="average"))
plot(hclust(dist(test, method="minkowski", p=1.5),method="average"))
par(mfrow=c(1,1))

hclust_different_distUsing average linkage, the same relationships were conserved with each distance matrix computation method (for this random dataset).

Correlation between distance matrices

The Mantel test performs a correlation between two distance matrices; this has been implemented in R and is part of the ade4 package.

install.packages("ade4")
library(ade4)

#use the geneData data from Biobase
#install if necessary
source("http://bioconductor.org/biocLite.R")
biocLite("Biobase")
 
#load the data
data(geneData, package = "Biobase")

#transpose to calculate distances for the samples
data <- t(geneData)

#store different distance matrices
eu <- dist(data, method="euclidean")
max <- dist(data, method="maximum")
man <- dist(data, method="manhattan")
can <- dist(data, method="canberra")
min <- dist(data, method="minkowski", p=1.5)

#Euclidean vs. Manhattan
mantel.rtest(eu, man, nrepet=1000)
Monte-Carlo test
Observation: 0.889386 
Call: mantelnoneuclid(m1 = m1, m2 = m2, nrepet = nrepet)
Based on 1000 replicates
Simulated p-value: 0.000999001

#Canberra vs. Manhattan
mantel.rtest(can, man, nrepet=10000)
Monte-Carlo test
Observation: -0.04415717 
Call: mantelnoneuclid(m1 = m1, m2 = m2, nrepet = nrepet)
Based on 10000 replicates
Simulated p-value: 0.7544246

#Minkowski vs. Manhattan
mantel.rtest(min, man, nrepet=1000)
Monte-Carlo test
Observation: 0.9629646 
Call: mantelnoneuclid(m1 = m1, m2 = m2, nrepet = nrepet)
Based on 1000 replicates
Simulated p-value: 0.000999001

#Minkowski vs. Euclidian
mantel.rtest(min, eu, nrepet=1000)
Monte-Carlo test
Observation: 0.9758106 
Call: mantelnoneuclid(m1 = m1, m2 = m2, nrepet = nrepet)
Based on 1000 replicates
Simulated p-value: 0.000999001

The Minkowski distance can be considered a generalisation of both the Euclidean distance and the Manhattan distance, thus we observed the high correlations (0.98 and 0.96 respectively).

par(mfrow=c(1,2))
plot(hclust(dist(data, method="minkowski", p=1.5),method="average"))
plot(hclust(dist(data, method="euclidean"),method="average"))

minkowski_vs_euclidean

While the Canberra distance is related to the Manhattan distance, there is no correlation between the two distance matrices.

par(mfrow=c(1,2))
plot(hclust(dist(data, method="canberra"),method="average"))
plot(hclust(dist(data, method="manhattan"),method="average"))

canberra_vs_manhattan

Conclusions

Different distance measures may dramatically affect the relationships in your dataset. One way to comparing the correlation of distance matrices is via the Mantel test, which is implemented in R within the ade4 package.

See also

This guide for labelling points on a 3D scatter plot (and how to make fancier 3D scatter plots).

Performing a Mantel test in R

Negative distances

Recently (2017 November) I learned of negative distances from the paper Clustering by Passing Messages Between Data Points, specifically negative squared Euclidean distance. The R package apcluster contains the function negDistMat(), which can be used to calculate the negative squared Euclidean distance (and others).

# install if necessary
install.packages("apcluster")

# load library
library(apcluster)

ex <- matrix(c(0, 0.5, 0.8, 1, 0, 0.2, 0.5, 0.7, 0.1, 0, 1, 0.3, 1, 0.8, 0.2), 5, 3, byrow = TRUE)
ex
     [,1] [,2] [,3]
[1,]  0.0  0.5  0.8
[2,]  1.0  0.0  0.2
[3,]  0.5  0.7  0.1
[4,]  0.0  1.0  0.3
[5,]  1.0  0.8  0.2

# Euclidean distance
# setting diagonal and upper triangle as TRUE as negDistMat()
dist(ex, diag = TRUE, upper = TRUE)
          1         2         3         4         5
1 0.0000000 1.2688578 0.8831761 0.7071068 1.2041595
2 1.2688578 0.0000000 0.8660254 1.4177447 0.8000000
3 0.8831761 0.8660254 0.0000000 0.6164414 0.5196152
4 0.7071068 1.4177447 0.6164414 0.0000000 1.0246951
5 1.2041595 0.8000000 0.5196152 1.0246951 0.0000000

# negative Euclidean distance
negDistMat(ex)
           1          2          3          4          5
1  0.0000000 -1.2688578 -0.8831761 -0.7071068 -1.2041595
2 -1.2688578  0.0000000 -0.8660254 -1.4177447 -0.8000000
3 -0.8831761 -0.8660254  0.0000000 -0.6164414 -0.5196152
4 -0.7071068 -1.4177447 -0.6164414  0.0000000 -1.0246951
5 -1.2041595 -0.8000000 -0.5196152 -1.0246951  0.0000000

# negative squared Euclidean distance
negDistMat(ex, r = 2)
      1     2     3     4     5
1  0.00 -1.61 -0.78 -0.50 -1.45
2 -1.61  0.00 -0.75 -2.01 -0.64
3 -0.78 -0.75  0.00 -0.38 -0.27
4 -0.50 -2.01 -0.38  0.00 -1.05
5 -1.45 -0.64 -0.27 -1.05  0.00

As stated in the clustering paper, for points x_i and x_k ,

s(i,k) = - || x_i - x_k ||^2

Here’s the implementation in R.

my_neg_sq_euclid_dist <- function(i, k){
   -( sum( (i - k)^2 ) )
}

# cross check results
negDistMat(ex, r = 2)
      1     2     3     4     5
1  0.00 -1.61 -0.78 -0.50 -1.45
2 -1.61  0.00 -0.75 -2.01 -0.64
3 -0.78 -0.75  0.00 -0.38 -0.27
4 -0.50 -2.01 -0.38  0.00 -1.05
5 -1.45 -0.64 -0.27 -1.05  0.00

# 1 vs 2
my_neg_sq_euclid_dist(ex[1,], ex[2,])
[1] -1.61

# 4 vs 5
my_neg_sq_euclid_dist(ex[4,], ex[5,])
[1] -1.05
Print Friendly, PDF & Email



Creative Commons License
This work is licensed under a Creative Commons
Attribution 4.0 International License
.
9 comments Add yours
  1. Hi,
    I was wondering if you can help me out with my situation. I am doing a Mantel test for my ecological data, I made different transects in the forest and within each transect I have the euclidean distances of my focal plants. So, all is fine, I can generate my matrices without utilizing coordinates and I am able to do a Mantel test per transect.
    But, now I want to know if my whole community (all transects made) have spatial autocorrelation. I have GPS coordinates for each transect, but NOT per plant which complicates things because I cannot know the distance for each plant between transects. Therefore, I can know the distances between transects (by converting my coordenates to euclidean distances) but how can I generate distances between the focal plants?
    It is very hard to explain, so I really hope I made sense. Do you know of a paper that might take me to the right direction? or do you know how to do that?
    Thank you for your time.

    1. Hi Heidi,

      thanks for the question but unfortunately it’s a bit outside my area of research (I’m not even familiar with what a transect is).

      Good luck!

      Dave

    2. If you have the coordinates of each plant inside a transect relative to, say, the center of the transect, and the coorindates of the centers of the transects, then you can find the coordinates of each of the plants by adding the coordinates together.

      R_plant = R_transect + r_plant_in_transect.

      This works as long as you have coordinates for the plants measured relative to a fixed location (not just a distance, though the coordinates could take the form of a distance + a polar angle).

  2. Do you know how to access elements in an R distance matrix? I’ve been consulting the R documentation without any success. E.g., for the results of dist(test), I’d like this to work:
    > dist_t = dist(test)
    >dist_t[1][2]
    10009.695

    but it doesn’t.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.