The Golden Rule of Bioinformatics

I'm a big fan of the book Bioinformatics Data Skills by Vince Buffalo and I highly recommend it to everyone who works in the bioinformatics field. The book introduces the reader to The Golden Rule of Bioinformatics, which is:

Never ever trust your tools (or data).

I am a strong proponent of this rule, which is why I always test new tools and explore my datasets.

Recently, I've been testing out different methods/metrics for determining the ideal number of clusters after clustering. In this post, I go through my usual approach when I try out a new tool.

The first thing I usually do is generate a simple dataset with distinctive properties; since we're testing clustering methods, I created a dataset with five obvious clusters.

generate_pts <- function(x, y, r, n){
  x1 <- x + runif(n = n, max = r)
  y1 <- y + runif(n = n, max = r)
  data.frame(x = x1, y = y1)
}

my_radius <- 2
my_number <- 20
my_x      <- 10
my_y      <- 10

set.seed(1984)
my_top_right    <- generate_pts(x = my_x, y = my_y, r = my_radius, n = my_number)
my_top_left     <- generate_pts(x = -my_x, y = my_y, r = my_radius, n = my_number)
my_bottom_right <- generate_pts(x = my_x, y = -my_y, r = my_radius, n = my_number)
my_bottom_left  <- generate_pts(x = -my_x, y = -my_y, r = my_radius, n = my_number)
my_middle       <- generate_pts(x = 0, y = 0, r = my_radius, n = my_number)

my_df <- rbind(my_top_right, my_top_left, my_bottom_right, my_bottom_left, my_middle)
plot(my_df, pch = 16, col = rep(2:6, each = 20))

Five clusters.

I found two R packages that can calculate clustering metrics: clValid and NbClust.

Install and load the packages.

my_pkg <- c("clValid", "NbClust")

for(pkg in my_pkg){
  if (! pkg %in% installed.packages()){
    install.packages(pkg)
  }
  library(pkg, character.only = TRUE)
}

We will use the clValid package first to calculate clustering metrics for three different clustering approaches.

my_clvalid <- clValid(obj = my_df,
                      nClust = 2:10,
                      clMethods = c("hierarchical", "kmeans", "pam"),
                      validation = "internal")

summary(my_clvalid)

Clustering Methods:
 hierarchical kmeans pam 

Cluster sizes:
 2 3 4 5 6 7 8 9 10 

Validation Measures:
                                 2       3       4       5       6       7       8       9      10
                                                                                                  
hierarchical Connectivity   0.0000  0.0000  0.0000  0.0000  7.1282 15.9175 22.5294 31.2492 40.2722
             Dunn           0.5540  0.5549  0.7343  4.8416  0.2231  0.2231  0.2231  0.1954  0.2164
             Silhouette     0.4550  0.5963  0.7720  0.9217  0.8270  0.7231  0.6168  0.5082  0.4024
kmeans       Connectivity   0.0000  0.0000  0.0000  0.0000  7.1282 15.9175 24.8421 33.5619 41.6754
             Dunn           0.5540  0.5549  0.7343  4.8416  0.2231  0.2231  0.2231  0.1954  0.2164
             Silhouette     0.4550  0.5963  0.7720  0.9217  0.8270  0.7231  0.6167  0.5080  0.4090
pam          Connectivity   0.0000  0.0000 11.7115  0.0000  9.2127 19.0353 20.7353 29.3425 37.4560
             Dunn           0.3960  0.3960  0.0278  4.8416  0.1665  0.1550  0.1550  0.1559  0.1559
             Silhouette     0.4180  0.5369  0.6844  0.9217  0.8196  0.7060  0.7430  0.6460  0.5469

Optimal Scores:

             Score  Method       Clusters
Connectivity 0.0000 hierarchical 2       
Dunn         4.8416 hierarchical 5       
Silhouette   0.9217 hierarchical 5

The summary provides us with three different metrics for three different clustering approaches using a range of k values. Based on the Dunn and Silhouette indices, we should set k to five, which is what we expected (great!). We can visualise the table above.

library(dplyr)
library(ggplot2)
library(reshape2)
library(cowplot)
my_plot <- list()
for(i in my_clvalid@measNames){
  p <- as.data.frame(my_clvalid@measures[i, ,]) %>%
    tibble::rownames_to_column(var = "cluster") %>%
    mutate(cluster = factor(cluster, levels = as.numeric(cluster))) %>%
    melt() %>%
    ggplot(., aes(x = cluster, y = value, colour = variable, group = variable)) +
    geom_line() +
    geom_point() +
    theme_bw() +
    ggtitle(i)
  my_plot[[i]] <- p
}

plot_grid(plotlist = my_plot, nrow = 1)

Now we'll test the NbClust package but using only the k-means algorithm and using the same range of k's as clValid.

my_nbclust <- NbClust(data = my_df, method = "kmeans", min.nc = 2, max.nc = 10)

# I've excluded the plots
*** : The Hubert index is a graphical method of determining the number of clusters.
                In the plot of Hubert index, we seek a significant knee that corresponds to a 
                significant increase of the value of the measure i.e the significant peak in Hubert
                index second differences plot. 
 
*** : The D index is a graphical method of determining the number of clusters. 
                In the plot of D index, we seek a significant knee (the significant peak in Dindex
                second differences plot) that corresponds to a significant increase of the value of
                the measure. 
 
******************************************************************* 
* Among all indices:                                                
* 4 proposed 2 as the best number of clusters 
* 5 proposed 3 as the best number of clusters 
* 1 proposed 4 as the best number of clusters 
* 1 proposed 5 as the best number of clusters 
* 11 proposed 6 as the best number of clusters 
* 1 proposed 9 as the best number of clusters 

                   ***** Conclusion *****                            
 
* According to the majority rule, the best number of clusters is  6 
 
 
*******************************************************************

The conclusion is that k should be six when I expected five. This was when I started to investigate what was wrong; I started by comparing the Dunn indices returned by both packages.

# clValid package
my_clvalid@measures["Dunn",,"kmeans"]
        2         3         4         5         6         7         8         9        10 
0.5540189 0.5548707 0.7342586 4.8415855 0.2230769 0.2230769 0.2230769 0.1953682 0.2163502

# NbClust package
my_nbclust$All.index[, "Dunn"]
     2      3      4      5      6      7      8      9     10 
0.5445 0.5339 0.0173 0.0322 0.2165 0.2178 0.1953 0.1675 0.1945

I expected to see the highest Dunn index at k = 5 but this was not the case for the NbClust package. I then looked at the source code of the packages. As it turns out, clValid does not simply perform k-means clustering; it uses hierarchical clustering first to determine starting points. Here's the link to the code, which I also show below:

# hclust is used prior to performing kmeans
         kmeans = {
           clusterObj <- vector("list",length=length(nClust))
           names(clusterObj) <- nClust
           clusterObjInit <- hclust(Dist,method)
         },

If we examine the source code of NbClust, we see that the k-means step is carried out differently, using most of the default parameters.

                { set.seed(1)
                  pp2 <- kmeans(Xnew, ClassNr, 100)$cluster
                }

Since a seed was used, we can replicate the k-means clustering step.

set.seed(1)
my_kmeans <- kmeans(x = my_df, centers = 5, iter.max = 100)
plot(my_df, col = my_kmeans$cluster, pch = 16)
points(my_kmeans$centers, pch = 4)

Not an ideal clustering of our dataset.

In the plot, I display the centres of each cluster, which were initialised randomly and adjusted until convergence. It turns out that the properties of my dataset (having four clouds of points on the extremities that are equidistant from the centre) results in different clustering results depending on the random initialisation of the centres. (The extra hierarchical clustering step performed by clValid prior to k-means clustering was very useful in this case.) Below I plot the results of 10 different k-means clustering results (performed by setting different seeds).

my_plot <- list()
for(i in 1:10){
  set.seed(i)
  my_kmeans <- kmeans(x = my_df, centers = 5, iter.max = 100)
  my_tmp <- cbind(my_df, cluster = factor(my_kmeans$cluster))
  p <- ggplot() +
    geom_point(data = my_tmp, aes(x, y, colour = cluster)) +
    geom_point(data = as.data.frame(my_kmeans$centers), aes(x, y), shape = 4, size = 4) +
    theme(legend.position = "none", axis.title = element_blank())
  my_plot[[i]] <- p
}

ggsave(filename = "kmeans_seed.png",
       plot = plot_grid(plotlist = my_plot, nrow = 2),
       width = 10,
       height = 5)

Four out of ten times, the k-means clustering result converged on the ideal clustering of our dataset. This is where the nstart parameter comes in useful; by default this is set to one, which means only one random run is performed, and this was the value used by NbClust. If we set this to ten, we should see the ideal clustering result each time since the best result is returned.

my_plot <- list()
for(i in 1:10){
  set.seed(i)
  my_kmeans <- kmeans(x = my_df, centers = 5, iter.max = 100, nstart = 10)
  my_tmp <- cbind(my_df, cluster = factor(my_kmeans$cluster))
  p <- ggplot() +
    geom_point(data = my_tmp, aes(x, y, colour = cluster)) +
    geom_point(data = as.data.frame(my_kmeans$centers), aes(x, y), shape = 4, size = 4) +
    theme(legend.position = "none", axis.title = element_blank())
  my_plot[[i]] <- p
}

ggsave(filename = "kmeans_nstart.png",
       plot = plot_grid(plotlist = my_plot, nrow = 2),
       width = 10,
       height = 5)

That's better.

Finally, if I perform the same clustering as the NbClust package (by setting the seed to 1), the Dunn indices are identical between the two packages.

my_nbclust <- NbClust(data = my_df, method = "kmeans", min.nc = 2, max.nc = 10)

di <- vector()
for(k in 2:10){
  set.seed(1)
  my_kmeans <- kmeans(x = my_df, centers = k, iter.max = 100)
  # the dunn function is from the clValid package
  my_di <- dunn(clusters = my_kmeans$cluster, Data = my_df)
  di <- append(di, round(my_di, 4))
}

my_nbclust$All.index[, "Dunn"]
     2      3      4      5      6      7      8      9     10 
0.5445 0.5339 0.0173 0.0322 0.2165 0.2178 0.1953 0.1675 0.1945

names(di) <- 2:10
di
     2      3      4      5      6      7      8      9     10 
0.5445 0.5339 0.0173 0.0322 0.2165 0.2178 0.1953 0.1675 0.1945

Summary

I like to test new tools for two main reasons:

  1. I want to see if it works as expected
  2. I want to learn more about how it works

Typically, I generate a simple dataset to use as input to a tool and compare the results with my expectations. In this post, the NbClust results were not consistent with my expectations, since I expected five clusters instead of six as the optimal result. This led me towards finding out that the clValid function did not simply carry out k-means clustering and had an additional hierarchical clustering step (which was stated in the article describing clValid). This additional step was quite useful as illustrated by the results of NbClust, where the points were clustered sub-optimally.

To conclude, I firmly believe that we should always scrutinise the tools that we use because they may not work in the way we expect.




Creative Commons License
This work is licensed under a Creative Commons
Attribution 4.0 International License
.
3 comments Add yours
  1. > Never ever trust your tools (or data).
    Second that. Cannot emphasize more, and always tell everyone never assume that things will work as expected, always test.

    A piece of code how to use NbClust is missing, here it is: `my_nbclust <- NbClust(data = my_df, method = "kmeans")`

  2. > generate a simple dataset to use as input to a tool and compare the results with my expectations

    Simple but clever. I’ve just realized how much this method helps to understand how a tool works.
    I’ll try to follow this approach in the future.
    Thanks

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.