# Distance Matrix Computation

Here I demonstrate the distance matrix computations using the R function dist().

Firstly let's prepare a small dataset to work with:

# set seed to make example reproducible
set.seed(123)
test <- data.frame(x=sample(1:10000,7),
y=sample(1:10000,7),
z=sample(1:10000,7))
test
x    y    z
1 2876 8925 1030
2 7883 5514 8998
3 4089 4566 2461
4 8828 9566  421
5 9401 4532 3278
6  456 6773 9541
7 5278 5723 8891


How does this dataset look in 3 dimensional space?

s3d <- scatterplot3d(test, color=1:7, pch=19, type="p")
s3d.coords <- s3d$xyz.convert(test) text(s3d.coords$x, s3d.coords$y, labels=row.names(test), cex=1, pos=4)  We can see that points 4 and 6 are quite far away from each other. The first distance matrix computation we'll calculate will be the Euclidean distance, since it's the easiest to understand and the default of dist(). The Euclidean distance is simply the distance one would physically measure, say with a ruler. For n-dimensions the formula for the Euclidean distance between points p and q is:$latex d(p,q) = d(q,p) = \sqrt{ \sum_{i=1}^{n} (p_{i} - q_{i})^2} &s=2 $# Euclidean distance in R euclidean_distance <- function(p,q){ sqrt(sum((p - q)^2)) } # what is the distance between 1 and 5? euclidean_distance(1,5) [1] 4 # what about Cartesian coordinates? euclidean_distance(c(2,2),c(5.535534,5.535534)) [1] 5 # what about points 4 and 6 from our data frame euclidean_distance(test[4,],test[6,]) [1] 12691.16 # all pairwise distances in test dist(test) 1 2 3 4 5 6 2 10009.695 3 4745.525 7617.448 4 6017.314 9532.925 7184.687 5 8180.928 5998.921 5374.569 5816.523 6 9106.296 7552.500 8258.083 12691.164 11147.209 7 8821.436 2615.560 6640.578 9955.503 7065.648 4977.618  As we can see above in the distance matrix, the distance between points 4 and 6 is 12691 (as we calculated using the formula) and is also the two furthest points (as we saw in the 3D scatterplot). Next is the "maximum" method. If you type ?dist, you can get the description of maximum Maximum distance between two components of x and y (supremum norm) Honestly, I don't have any idea what that means, so let's try to find out with an example. example_1 <- data.frame(x=c(2,6), y=c(4,6)) example_1 plot(example_1,pch=19) # lines is in the format (x1, x2) (y1, y2) lines(c(example_1[1,1],example_1[2,1]),c(example_1[1,2],example_1[2,2])) lines(c(example_1[1,1],example_1[2,1]),c(example_1[1,2],example_1[1,2])) lines(c(example_1[2,1],example_1[2,1]),c(example_1[1,2],example_1[2,2]))  # Euclidean distance is the hypotenuse dist(example_1) 1 2 4.472136 # seems to take the longest edge of the triangle # that isn't the hypotenuse dist(example_1, method="maximum") 1 2 4 # test with 3 coordinates test[1:2,] x y z 1 4014 689 6716 2 5702 7434 9359 dist(test[1:2,]) 1 2 7438.402 dist(test[1:2,],method="maximum") 1 2 6745 # calculate all edges straight_distance <- function(p,q){ sqrt((p-q)^2) } # voila max(straight_distance(test[1,],test[2,])) [1] 6745 dist(test, method="maximum") 1 2 3 4 5 6 2 6745 3 6065 2503 4 6343 8986 6483 5 3328 3570 2811 9294 6 3322 5932 5252 6147 5204 7 1795 6050 5370 7380 3677 1527 # distance from 6 to 7 max(straight_distance(test[6,],test[7,])) [1] 1527  The Manhattan distance is given as:$latex \sum_{i=1}^{n} | p_{i} - q_{i}|, &s=2 $where$latex p = (p_{1}, p_{2},\ldots, p_{n}) \ and \ q = (q_{1}, q_{2},\ldots,q_{n}) &s=2 $are vectors. # following the definition above manhattan_distance <- function(p,q){ sum(abs(p-q)) } # manhattan distance from 6 to 7 manhattan_distance(test[6,],test[7,]) [1] 2878 dist(test, method="manhattan") 1 2 3 4 5 6 2 11076 3 6528 5194 4 8072 16072 11142 5 8161 7295 7107 12775 6 4331 10405 9233 10385 10866 7 3527 7763 8385 10209 8224 2878  The Canberra distance is given as:$latex d(p,q) = \sum_{i=1}^{n} \frac{| p_{i} - q_{i}|} {|p_{i}| + |q_{i}| }, &s=2 $where$latex p = (p_{1}, p_{2},\ldots, p_{n}) \ and \ q = (q_{1}, q_{2},\ldots,q_{n}) &s=2 $are vectors. canberra_distance <- function(p,q){ sum( (abs(p-q)) / (abs(p) + abs(q)) ) } canberra_distance(test[6,],test[7,]) [1] 0.2434398 dist(test, method="canberra") 1 2 3 4 5 6 2 1.1685091 3 0.8670958 0.4163867 4 1.4465730 1.6595871 1.4184359 5 1.1935235 0.7702962 0.6919662 1.4963355 6 0.6785588 0.9679473 0.9918153 1.4010095 1.1997547 7 0.5896678 0.7892443 0.9442152 1.3478370 1.0604159 0.2434398  The Minkowski distance is a metric on Euclidean space which can be considered as a generalisation of both the Euclidean distance and the Manhattan distance. # Euclidean distance dist(test[1:2,]) 1 2 7438.402 # Minkowski distance with p = 2 dist(test[1:2,],method="minkowski",p=2) 1 2 7438.402 # Manhattan distance dist(test[1:2,],method="manhattan") 1 2 11076 # Minkowski distance with p = 1 dist(test[1:2,],method="minkowski",p=1) 1 2 11076 # Minkowski distance with p = 1.5 dist(test[1:2,],method="minkowski",p=1.5) 1 2 8322.055  ### Clustering based on different distant measures The most important question for me was how the different distance calculations affected the relationships in the data. par(mfrow=c(2,3)) plot(hclust(dist(test),method="average")) plot(hclust(dist(test, method="maximum"),method="average")) plot(hclust(dist(test, method="manhattan"),method="average")) plot(hclust(dist(test, method="canberra"),method="average")) plot(hclust(dist(test, method="minkowski", p=1.5),method="average")) par(mfrow=c(1,1))  Using average linkage, the same relationships were conserved with each distance matrix computation method (for this random dataset). ### Correlation between distance matrices The Mantel test performs a correlation between two distance matrices; this has been implemented in R and is part of the ade4 package. install.packages("ade4") library(ade4) # use the geneData data from Biobase # install if necessary if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("Biobase") # load the data data(geneData, package = "Biobase") #transpose to calculate distances for the samples data <- t(geneData) # store different distance matrices eu_dist <- dist(data, method="euclidean") max_dist <- dist(data, method="maximum") man_dist <- dist(data, method="manhattan") can_dist <- dist(data, method="canberra") min_dist <- dist(data, method="minkowski", p=1.5) # Euclidean vs. Manhattan mantel.rtest(eu_dist, man_dist, nrepet=1000) Monte-Carlo test Call: mantelnoneuclid(m1 = m1, m2 = m2, nrepet = nrepet) Observation: 0.889386 Based on 1000 replicates Simulated p-value: 0.000999001 Alternative hypothesis: greater Std.Obs Expectation Variance 8.503747190 -0.002401909 0.010997716 # Canberra vs. Manhattan mantel.rtest(can_dist, man_dist, nrepet=10000) Monte-Carlo test Call: mantelnoneuclid(m1 = m1, m2 = m2, nrepet = nrepet) Observation: 0.5779173 Based on 10000 replicates Simulated p-value: 9.999e-05 Alternative hypothesis: greater Std.Obs Expectation Variance 6.5606452470 -0.0005710482 0.0077749247 # Minkowski vs. Manhattan mantel.rtest(min_dist, man_dist, nrepet=1000) Monte-Carlo test Call: mantelnoneuclid(m1 = m1, m2 = m2, nrepet = nrepet) Observation: 0.9629646 Based on 1000 replicates Simulated p-value: 0.000999001 Alternative hypothesis: greater Std.Obs Expectation Variance 10.051041373 0.003826956 0.009106254 # Minkowski vs. Euclidian mantel.rtest(min_dist, eu_dist, nrepet=1000) Monte-Carlo test Call: mantelnoneuclid(m1 = m1, m2 = m2, nrepet = nrepet) Observation: 0.9758106 Based on 1000 replicates Simulated p-value: 0.000999001 Alternative hypothesis: greater Std.Obs Expectation Variance 8.163632265 -0.002258898 0.014353988  The Minkowski distance can be considered a generalisation of both the Euclidean distance and the Manhattan distance, thus we observed the high correlations (0.98 and 0.96 respectively). par(mfrow=c(1,2)) plot(hclust(dist(data, method="minkowski", p=1.5),method="average")) plot(hclust(dist(data, method="euclidean"),method="average"))  While the Canberra distance is related to the Manhattan distance, there is no correlation between the two distance matrices. par(mfrow=c(1,2)) plot(hclust(dist(data, method="canberra"),method="average")) plot(hclust(dist(data, method="manhattan"),method="average"))  ### Conclusions Different distance measures may dramatically affect the relationships in your dataset. One way to comparing the correlation of distance matrices is via the Mantel test, which is implemented in R within the ade4 package. ### See also This guide for labelling points on a 3D scatter plot (and how to make fancier 3D scatter plots). Performing a Mantel test in R ### Negative distances Recently (2017 November) I learned of negative distances from the paper Clustering by Passing Messages Between Data Points, specifically negative squared Euclidean distance. The R package apcluster contains the function negDistMat(), which can be used to calculate the negative squared Euclidean distance (and others). # install if necessary install.packages("apcluster") # load library library(apcluster) ex <- matrix(c(0, 0.5, 0.8, 1, 0, 0.2, 0.5, 0.7, 0.1, 0, 1, 0.3, 1, 0.8, 0.2), 5, 3, byrow = TRUE) ex [,1] [,2] [,3] [1,] 0.0 0.5 0.8 [2,] 1.0 0.0 0.2 [3,] 0.5 0.7 0.1 [4,] 0.0 1.0 0.3 [5,] 1.0 0.8 0.2 # Euclidean distance # setting diagonal and upper triangle as TRUE as negDistMat() dist(ex, diag = TRUE, upper = TRUE) 1 2 3 4 5 1 0.0000000 1.2688578 0.8831761 0.7071068 1.2041595 2 1.2688578 0.0000000 0.8660254 1.4177447 0.8000000 3 0.8831761 0.8660254 0.0000000 0.6164414 0.5196152 4 0.7071068 1.4177447 0.6164414 0.0000000 1.0246951 5 1.2041595 0.8000000 0.5196152 1.0246951 0.0000000 # negative Euclidean distance negDistMat(ex) 1 2 3 4 5 1 0.0000000 -1.2688578 -0.8831761 -0.7071068 -1.2041595 2 -1.2688578 0.0000000 -0.8660254 -1.4177447 -0.8000000 3 -0.8831761 -0.8660254 0.0000000 -0.6164414 -0.5196152 4 -0.7071068 -1.4177447 -0.6164414 0.0000000 -1.0246951 5 -1.2041595 -0.8000000 -0.5196152 -1.0246951 0.0000000 # negative squared Euclidean distance negDistMat(ex, r = 2) 1 2 3 4 5 1 0.00 -1.61 -0.78 -0.50 -1.45 2 -1.61 0.00 -0.75 -2.01 -0.64 3 -0.78 -0.75 0.00 -0.38 -0.27 4 -0.50 -2.01 -0.38 0.00 -1.05 5 -1.45 -0.64 -0.27 -1.05 0.00  As stated in the clustering paper, for points$latex x_i $and$latex x_k $,$latex s(i,k) = - || x_i - x_k ||^2 &s=2\$

Here's the implementation in R.

my_neg_sq_euclid_dist <- function(i, k){
-( sum( (i - k)^2 ) )
}

# cross check results
negDistMat(ex, r = 2)
1     2     3     4     5
1  0.00 -1.61 -0.78 -0.50 -1.45
2 -1.61  0.00 -0.75 -2.01 -0.64
3 -0.78 -0.75  0.00 -0.38 -0.27
4 -0.50 -2.01 -0.38  0.00 -1.05
5 -1.45 -0.64 -0.27 -1.05  0.00

# 1 vs 2
my_neg_sq_euclid_dist(ex[1,], ex[2,])
[1] -1.61

# 4 vs 5
my_neg_sq_euclid_dist(ex[4,], ex[5,])
[1] -1.05


.
1. wb says:

Did you try also multidimensional scaling on dist data?

1. Davo says:

No I haven’t tried that. I’m not sure how I would interpret the results of that.

2. Heidi says:

Hi,
I was wondering if you can help me out with my situation. I am doing a Mantel test for my ecological data, I made different transects in the forest and within each transect I have the euclidean distances of my focal plants. So, all is fine, I can generate my matrices without utilizing coordinates and I am able to do a Mantel test per transect.
But, now I want to know if my whole community (all transects made) have spatial autocorrelation. I have GPS coordinates for each transect, but NOT per plant which complicates things because I cannot know the distance for each plant between transects. Therefore, I can know the distances between transects (by converting my coordenates to euclidean distances) but how can I generate distances between the focal plants?
It is very hard to explain, so I really hope I made sense. Do you know of a paper that might take me to the right direction? or do you know how to do that?

1. Davo says:

Hi Heidi,

thanks for the question but unfortunately it’s a bit outside my area of research (I’m not even familiar with what a transect is).

Good luck!

Dave

2. Jackie says:

If you have the coordinates of each plant inside a transect relative to, say, the center of the transect, and the coorindates of the centers of the transects, then you can find the coordinates of each of the plants by adding the coordinates together.

R_plant = R_transect + r_plant_in_transect.

This works as long as you have coordinates for the plants measured relative to a fixed location (not just a distance, though the coordinates could take the form of a distance + a polar angle).

1. Haydee says:

Thank you! I have fixed the problem. The huge thing is that I do not have gps coordinates for each plant.

3. Phil says:

Do you know how to access elements in an R distance matrix? I’ve been consulting the R documentation without any success. E.g., for the results of dist(test), I’d like this to work:
> dist_t = dist(test)
>dist_t[1][2]
10009.695

but it doesn’t.

1. Davo says:

Try this:

as.matrix(dist_t)[1,2]

4. Yazen says:

This 3D geometric visualization really helped. Thank you very much!

This site uses Akismet to reduce spam. Learn how your comment data is processed.