# Distance Matrix Computation

Here I demonstrate the distance matrix computations using the R function dist().

Firstly let's prepare a small dataset to work with:

# set seed to make example reproducible
set.seed(123)
test <- data.frame(x=sample(1:10000,7),
y=sample(1:10000,7),
z=sample(1:10000,7))
test
x    y    z
1 2876 8925 1030
2 7883 5514 8998
3 4089 4566 2461
4 8828 9566  421
5 9401 4532 3278
6  456 6773 9541
7 5278 5723 8891


How does this dataset look in 3 dimensional space?

s3d <- scatterplot3d(test, color=1:7, pch=19, type="p")
s3d.coords <- s3d$xyz.convert(test) text(s3d.coords$x, s3d.coords\$y, labels=row.names(test), cex=1, pos=4) We can see that points 4 and 6 are quite far away from each other.

The first distance matrix computation we'll calculate will be the Euclidean distance, since it's the easiest to understand and the default of dist(). The Euclidean distance is simply the distance one would physically measure, say with a ruler. For n-dimensions the formula for the Euclidean distance between points p and q is: $d(p,q) = d(q,p) = \sqrt{ \sum_{i=1}^{n} (p_{i} - q_{i})^2}$

# Euclidean distance in R
euclidean_distance <- function(p,q){
sqrt(sum((p - q)^2))
}

# what is the distance between 1 and 5?
euclidean_distance(1,5)
 4

euclidean_distance(c(2,2),c(5.535534,5.535534))
 5

# what about points 4 and 6 from our data frame
euclidean_distance(test[4,],test[6,])
 12691.16

# all pairwise distances in test
dist(test)
1         2         3         4         5         6
2 10009.695
3  4745.525  7617.448
4  6017.314  9532.925  7184.687
5  8180.928  5998.921  5374.569  5816.523
6  9106.296  7552.500  8258.083 12691.164 11147.209
7  8821.436  2615.560  6640.578  9955.503  7065.648  4977.618


As we can see above in the distance matrix, the distance between points 4 and 6 is 12691 (as we calculated using the formula) and is also the two furthest points (as we saw in the 3D scatterplot).

Next is the "maximum" method. If you type ?dist, you can get the description of maximum

Maximum distance between two components of x and y (supremum norm)

Honestly, I don't have any idea what that means, so let's try to find out with an example.

example_1 <- data.frame(x=c(2,6), y=c(4,6))
example_1
plot(example_1,pch=19)

# lines is in the format (x1, x2) (y1, y2)
lines(c(example_1[1,1],example_1[2,1]),c(example_1[1,2],example_1[2,2]))
lines(c(example_1[1,1],example_1[2,1]),c(example_1[1,2],example_1[1,2]))
lines(c(example_1[2,1],example_1[2,1]),c(example_1[1,2],example_1[2,2])) # Euclidean distance is the hypotenuse
dist(example_1)
1
2 4.472136

# seems to take the longest edge of the triangle
# that isn't the hypotenuse
dist(example_1, method="maximum")
1
2 4

# test with 3 coordinates
test[1:2,]
x    y    z
1 4014  689 6716
2 5702 7434 9359

dist(test[1:2,])
1
2 7438.402

dist(test[1:2,],method="maximum")
1
2 6745

# calculate all edges
straight_distance <- function(p,q){
sqrt((p-q)^2)
}

# voila
max(straight_distance(test[1,],test[2,]))
 6745

dist(test, method="maximum")
1    2    3    4    5    6
2 6745
3 6065 2503
4 6343 8986 6483
5 3328 3570 2811 9294
6 3322 5932 5252 6147 5204
7 1795 6050 5370 7380 3677 1527

# distance from 6 to 7
max(straight_distance(test[6,],test[7,]))
 1527


The Manhattan distance is given as: $\sum_{i=1}^{n} | p_{i} - q_{i}|,$

where $p = (p_{1}, p_{2},\ldots, p_{n}) \ and \ q = (q_{1}, q_{2},\ldots,q_{n})$

are vectors.

# following the definition above
manhattan_distance <- function(p,q){
sum(abs(p-q))
}

# manhattan distance from 6 to 7
manhattan_distance(test[6,],test[7,])
 2878

dist(test, method="manhattan")
1     2     3     4     5     6
2 11076
3  6528  5194
4  8072 16072 11142
5  8161  7295  7107 12775
6  4331 10405  9233 10385 10866
7  3527  7763  8385 10209  8224  2878


The Canberra distance is given as: $d(p,q) = \sum_{i=1}^{n} \frac{| p_{i} - q_{i}|} {|p_{i}| + |q_{i}| },$

where $p = (p_{1}, p_{2},\ldots, p_{n}) \ and \ q = (q_{1}, q_{2},\ldots,q_{n})$

are vectors.

canberra_distance <- function(p,q){
sum( (abs(p-q)) / (abs(p) + abs(q)) )
}

canberra_distance(test[6,],test[7,])
 0.2434398

dist(test, method="canberra")
1         2         3         4         5         6
2 1.1685091
3 0.8670958 0.4163867
4 1.4465730 1.6595871 1.4184359
5 1.1935235 0.7702962 0.6919662 1.4963355
6 0.6785588 0.9679473 0.9918153 1.4010095 1.1997547
7 0.5896678 0.7892443 0.9442152 1.3478370 1.0604159 0.2434398


The Minkowski distance is a metric on Euclidean space which can be considered as a generalisation of both the Euclidean distance and the Manhattan distance.

# Euclidean distance
dist(test[1:2,])
1
2 7438.402

# Minkowski distance with p = 2
dist(test[1:2,],method="minkowski",p=2)
1
2 7438.402

# Manhattan distance
dist(test[1:2,],method="manhattan")
1
2 11076

# Minkowski distance with p = 1
dist(test[1:2,],method="minkowski",p=1)
1
2 11076

# Minkowski distance with p = 1.5
dist(test[1:2,],method="minkowski",p=1.5)
1
2 8322.055


### Clustering based on different distant measures

The most important question for me was how the different distance calculations affected the relationships in the data.

par(mfrow=c(2,3))
plot(hclust(dist(test),method="average"))
plot(hclust(dist(test, method="maximum"),method="average"))
plot(hclust(dist(test, method="manhattan"),method="average"))
plot(hclust(dist(test, method="canberra"),method="average"))
plot(hclust(dist(test, method="minkowski", p=1.5),method="average"))
par(mfrow=c(1,1)) Using average linkage, the same relationships were conserved with each distance matrix computation method (for this random dataset).

### Correlation between distance matrices

The Mantel test performs a correlation between two distance matrices; this has been implemented in R and is part of the ade4 package.

install.packages("ade4")

# use the geneData data from Biobase
# install if necessary
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")

BiocManager::install("Biobase")

data(geneData, package = "Biobase")

#transpose to calculate distances for the samples
data <- t(geneData)

# store different distance matrices
eu_dist  <- dist(data, method="euclidean")
max_dist <- dist(data, method="maximum")
man_dist <- dist(data, method="manhattan")
can_dist <- dist(data, method="canberra")
min_dist <- dist(data, method="minkowski", p=1.5)

# Euclidean vs. Manhattan
mantel.rtest(eu_dist, man_dist, nrepet=1000)
Monte-Carlo test
Call: mantelnoneuclid(m1 = m1, m2 = m2, nrepet = nrepet)

Observation: 0.889386

Based on 1000 replicates
Simulated p-value: 0.000999001
Alternative hypothesis: greater

Std.Obs  Expectation     Variance
8.503747190 -0.002401909  0.010997716

# Canberra vs. Manhattan
mantel.rtest(can_dist, man_dist, nrepet=10000)
Monte-Carlo test
Call: mantelnoneuclid(m1 = m1, m2 = m2, nrepet = nrepet)

Observation: 0.5779173

Based on 10000 replicates
Simulated p-value: 9.999e-05
Alternative hypothesis: greater

Std.Obs   Expectation      Variance
6.5606452470 -0.0005710482  0.0077749247

# Minkowski vs. Manhattan
mantel.rtest(min_dist, man_dist, nrepet=1000)
Monte-Carlo test
Call: mantelnoneuclid(m1 = m1, m2 = m2, nrepet = nrepet)

Observation: 0.9629646

Based on 1000 replicates
Simulated p-value: 0.000999001
Alternative hypothesis: greater

Std.Obs  Expectation     Variance
10.051041373  0.003826956  0.009106254

# Minkowski vs. Euclidian
mantel.rtest(min_dist, eu_dist, nrepet=1000)
Monte-Carlo test
Call: mantelnoneuclid(m1 = m1, m2 = m2, nrepet = nrepet)

Observation: 0.9758106

Based on 1000 replicates
Simulated p-value: 0.000999001
Alternative hypothesis: greater

Std.Obs  Expectation     Variance
8.163632265 -0.002258898  0.014353988


The Minkowski distance can be considered a generalisation of both the Euclidean distance and the Manhattan distance, thus we observed the high correlations (0.98 and 0.96 respectively).

par(mfrow=c(1,2))
plot(hclust(dist(data, method="minkowski", p=1.5),method="average"))
plot(hclust(dist(data, method="euclidean"),method="average")) While the Canberra distance is related to the Manhattan distance, there is no correlation between the two distance matrices.

par(mfrow=c(1,2))
plot(hclust(dist(data, method="canberra"),method="average"))
plot(hclust(dist(data, method="manhattan"),method="average")) ### Conclusions

Different distance measures may dramatically affect the relationships in your dataset. One way to comparing the correlation of distance matrices is via the Mantel test, which is implemented in R within the ade4 package.

This guide for labelling points on a 3D scatter plot (and how to make fancier 3D scatter plots).

Performing a Mantel test in R

### Negative distances

Recently (2017 November) I learned of negative distances from the paper Clustering by Passing Messages Between Data Points, specifically negative squared Euclidean distance. The R package apcluster contains the function negDistMat(), which can be used to calculate the negative squared Euclidean distance (and others).

# install if necessary
install.packages("apcluster")

library(apcluster)

ex <- matrix(c(0, 0.5, 0.8, 1, 0, 0.2, 0.5, 0.7, 0.1, 0, 1, 0.3, 1, 0.8, 0.2), 5, 3, byrow = TRUE)
ex
[,1] [,2] [,3]
[1,]  0.0  0.5  0.8
[2,]  1.0  0.0  0.2
[3,]  0.5  0.7  0.1
[4,]  0.0  1.0  0.3
[5,]  1.0  0.8  0.2

# Euclidean distance
# setting diagonal and upper triangle as TRUE as negDistMat()
dist(ex, diag = TRUE, upper = TRUE)
1         2         3         4         5
1 0.0000000 1.2688578 0.8831761 0.7071068 1.2041595
2 1.2688578 0.0000000 0.8660254 1.4177447 0.8000000
3 0.8831761 0.8660254 0.0000000 0.6164414 0.5196152
4 0.7071068 1.4177447 0.6164414 0.0000000 1.0246951
5 1.2041595 0.8000000 0.5196152 1.0246951 0.0000000

# negative Euclidean distance
negDistMat(ex)
1          2          3          4          5
1  0.0000000 -1.2688578 -0.8831761 -0.7071068 -1.2041595
2 -1.2688578  0.0000000 -0.8660254 -1.4177447 -0.8000000
3 -0.8831761 -0.8660254  0.0000000 -0.6164414 -0.5196152
4 -0.7071068 -1.4177447 -0.6164414  0.0000000 -1.0246951
5 -1.2041595 -0.8000000 -0.5196152 -1.0246951  0.0000000

# negative squared Euclidean distance
negDistMat(ex, r = 2)
1     2     3     4     5
1  0.00 -1.61 -0.78 -0.50 -1.45
2 -1.61  0.00 -0.75 -2.01 -0.64
3 -0.78 -0.75  0.00 -0.38 -0.27
4 -0.50 -2.01 -0.38  0.00 -1.05
5 -1.45 -0.64 -0.27 -1.05  0.00


As stated in the clustering paper, for points $x_i$ and $x_k$, $s(i,k) = - || x_i - x_k ||^2$

Here's the implementation in R.

my_neg_sq_euclid_dist <- function(i, k){
-( sum( (i - k)^2 ) )
}

# cross check results
negDistMat(ex, r = 2)
1     2     3     4     5
1  0.00 -1.61 -0.78 -0.50 -1.45
2 -1.61  0.00 -0.75 -2.01 -0.64
3 -0.78 -0.75  0.00 -0.38 -0.27
4 -0.50 -2.01 -0.38  0.00 -1.05
5 -1.45 -0.64 -0.27 -1.05  0.00

# 1 vs 2
my_neg_sq_euclid_dist(ex[1,], ex[2,])
 -1.61

# 4 vs 5
my_neg_sq_euclid_dist(ex[4,], ex[5,])
 -1.05 .
1. wb says:

Did you try also multidimensional scaling on dist data?

1. Davo says:

No I haven’t tried that. I’m not sure how I would interpret the results of that.

2. Heidi says:

Hi,
I was wondering if you can help me out with my situation. I am doing a Mantel test for my ecological data, I made different transects in the forest and within each transect I have the euclidean distances of my focal plants. So, all is fine, I can generate my matrices without utilizing coordinates and I am able to do a Mantel test per transect.
But, now I want to know if my whole community (all transects made) have spatial autocorrelation. I have GPS coordinates for each transect, but NOT per plant which complicates things because I cannot know the distance for each plant between transects. Therefore, I can know the distances between transects (by converting my coordenates to euclidean distances) but how can I generate distances between the focal plants?
It is very hard to explain, so I really hope I made sense. Do you know of a paper that might take me to the right direction? or do you know how to do that?

1. Davo says:

Hi Heidi,

thanks for the question but unfortunately it’s a bit outside my area of research (I’m not even familiar with what a transect is).

Good luck!

Dave

2. Jackie says:

If you have the coordinates of each plant inside a transect relative to, say, the center of the transect, and the coorindates of the centers of the transects, then you can find the coordinates of each of the plants by adding the coordinates together.

R_plant = R_transect + r_plant_in_transect.

This works as long as you have coordinates for the plants measured relative to a fixed location (not just a distance, though the coordinates could take the form of a distance + a polar angle).

1. Haydee says:

Thank you! I have fixed the problem. The huge thing is that I do not have gps coordinates for each plant.

3. Phil says:

Do you know how to access elements in an R distance matrix? I’ve been consulting the R documentation without any success. E.g., for the results of dist(test), I’d like this to work:
> dist_t = dist(test)
>dist_t
10009.695

but it doesn’t.

1. Davo says:

Try this:

as.matrix(dist_t)[1,2]

4. Yazen says:

This 3D geometric visualization really helped. Thank you very much!

This site uses Akismet to reduce spam. Learn how your comment data is processed.