Markov clustering

The Markov Cluster (MCL) Algorithm is an unsupervised cluster algorithm for graphs based on simulation of stochastic flow in graphs. Markov clustering was the work of Stijn van Dongen and you can read his thesis on the Markov Cluster Algorithm. The work is based on the graph clustering paradigm, which postulates that natural groups in graphs (something we aim to look for) have the following property:

A random walk in G that visits a dense cluster will likely not leave the cluster until many of its vertices have been visited.

So in my limited understanding, the MCL algorithm simulates flow within a graph and promotes flow in a highly connected region and demotes otherwise, thus revealing natural groups within the graph. This animation illustrates the process.

For this post, I will create a simple dataset to test the software. To begin, download and install mcl:

wget http://micans.org/mcl/src/mcl-latest.tar.gz
tar xzf mcl-latest.tar.gz
cd mcl-12-068/
./configure --prefix=/home/davetang/src/mcl-12-068/
make install
cd ~/bin
ln -s /home/davetang/src/mcl-12-068/bin/* .

The most basic usage of the mcl program is:

mcl <infile> --abc -o outfile

where the ABC format is a three column file with two labels (A and B) and a numerical value (C), separated by white space. A single parameter, the inflation option (-I), controls the granularity of the output clustering. In standard usage of the program this parameter is the only one that may require changing.

The output is a file with one or more lines, where each line is a cluster of tab separated labels. Here's the example from the mcl man page:

#examining the input
cat infile
cat hat  0.2
hat bat  0.16
bat cat  1.0
bat bit  0.125
bit fit  0.25
fit hit  0.5
hit bit  0.16

#running mcl
mcl infile --abc -o outfile

#examining the output
cat outfile
cat     hat     bat
bit     fit     hit

I will create a test dataset with R:

#for visualisation purposes
#install the igraph package
install.packages('igraph')
#load library
library(igraph)

#how many total edges
edge <- 500

#ratio
ratio <- c(0.1, 0.1, 0.2, 0.1, 0.1, 0.2, 0.2)

#set seed
set.seed(12345)

#sample 50 letters from {a, b, c, d, e}
a <- sample(x=letters[1:5], size=ratio[1] * edge, replace=T)
aa <- sample(x=letters[1:5], size=ratio[1] * edge, replace=T)

#sample 50 letters from {f, g, h, i, j}
b <- sample(x=letters[6:10], size=ratio[2] * edge, replace=T)
bb <- sample(x=letters[6:10], size=ratio[2] * edge, replace=T)

#and so on
c <- sample(x=letters[1:10], size=ratio[3] * edge, replace=T)
cc <- sample(x=letters[1:10], size=ratio[3] * edge, replace=T)
d <- sample(x=letters[17:21], size=ratio[4] * edge, replace=T)
dd <- sample(x=letters[17:21], size=ratio[4] * edge, replace=T)
e <- sample(x=letters[22:26], size=ratio[5] * edge, replace=T)
ee <- sample(x=letters[22:26], size=ratio[5] * edge, replace=T)
f <- sample(x=letters[17:26], size=ratio[6] * edge, replace=T)
ff <- sample(x=letters[17:26], size=ratio[6] * edge, replace=T)
g <- sample(x=letters, size=ratio[7] * edge, replace=T)
gg <- sample(x=letters, size=ratio[7] * edge, replace=T)

#concatenate the letters
first <- c(a,b,c,d,e,f,g)
second <- c(aa,bb,cc,dd,ee,ff,gg)

#combine
random <- cbind(first, second)

head(random)
#     first second
#[1,] "d"   "e"   
#[2,] "e"   "e"   
#[3,] "d"   "b"   
#[4,] "e"   "b"   
#[5,] "c"   "d"   
#[6,] "a"   "c"

#create igraph graph
random.network <- graph.data.frame(random, directed=F)

#plot graph
plot(random.network)

The idea of the above code was to create more connections between items sampled in the same space; for example, a and aa were sampled from the first five letters of the alphabet and therefore they will have more connections to each other. The vectors b and bb do the same thing but for the sixth to tenth letter of the alphabet. The vectors c and cc, enrich connections between the first ten letters of the alphabet. This process is repeated for the seventeenth to twenty-sixth letters of the alphabet.

mock_networkThere are two dense regions; connections between letters from a to j and letters from q to z. Within these two dense regions are more layers of connections between the letters a to e, f to j, q to u, and v to z.

We can also visualise the connections with a heatmap, following this awesome post:

#using the previous object
dim(random)
[1] 500   2

#converting into a table
random_table <- table(random[,1], random[,2])

#viewing a subset of the table
random_table[1:5,1:5]
   
    a b c d  e
  a 6 9 5 4 11
  b 4 4 7 5  7
  c 5 2 6 9  6
  d 6 7 2 9  9
  e 6 6 1 8  5

#as you can see the table is not symmetric
#i.e a to b is not equal to b to a
#since I don't care about directionality
#I will combine the two
connection <- random_table[lower.tri(random_table)] + t(random_table)[lower.tri(random_table)]

#create a new table with the new values
#for an explanation of these steps see my recent post
#http://davetang.org/muse/2014/01/22/making-symmetric-matrices-in-r/
connection_table <- matrix(rep(x=0,26*26),
                           nrow=26,
                           dimnames=list(letters, letters)
                           )

connection_table[lower.tri(connection_table)] <- connection
connection_table <- t(connection_table)
connection_table[lower.tri(connection_table)] <- connection
diag(connection_table) <- diag(random_table)

#check for symmetry
isSymmetric(connection_table)
[1] TRUE

#convert object into a molten data frame
library(ggplot2)
library(reshape2)
connection_table_melted <- melt(connection_table)

#plot
qplot(x=Var1, y=Var2, data=connection_table_melted, fill=value, geom="tile")

heatmap_connectionHeatmap of the connections between the letters of the alphabet.

Now let's output the data from R for input into the mcl program:

dim(random)
[1] 500    2
for_mcl <- cbind(random, score=rep(0.95,edge))
filename <- paste('for_mcl_', edge, '.tsv', sep='')
write.table(for_mcl, file=filename, quote=F, sep="\t", row.names=F, col.names=F)

Let's run mcl with a set of inflation parameters using this script:

#contents of run.sh
#!/bin/bash

infile=$1
base=`basename $infile .tsv`

for i in {1..10}
   do mcl $infile --abc -I $i -te 16 -o ${base}_$i.out
done

#running the script
run.sh for_mcl_500.tsv

#inflation parameter I = 1
#resulting in one group
cat for_mcl_500_1.out | sed 's/\t//g'
debcagjihfqtsruwzyxvnpolmk

#I = 2 and 3
#resulting in two groups
cat for_mcl_500_2.out | sed 's/\t//g'
qtsruwzyxvnpolmk
debcagjihf

#I = 4
#resulting in 5 groups
cat for_mcl_500_4.out | sed 's/\t//g'
qtsruwzyxvnpom
dbagjhf
eci
l
k

#I = 5
#many groups
cat for_mcl_500_5.out| sed 's/\t//g'
quwyxvn
dbaghf
tsrz
ec
j
i
p
o
l
m
k

As we increase the inflation parameter, we observe some of the groups that we artificially created. At I = 2 and 3, we observed the grouping of the first 10 letters of the alphabet (debcagjihf), but the grouping of the last 10 letters were combined with the 6 middle letters of the alphabet (qtsruwzyxvnpolmk). At I = 4, the letter k and l were singled out and we observe finer granularity of the first 10 letters. At I = 5, the middle letters of the alphabet (klmop) are singled out, as well as j and i.

Conclusions

Markov clustering can be used as a tool for revealing natural groups within a highly connected graph. As stated in the manual, if clustering is part of a larger workflow where it is desirable to analyse and compare different clusters, then it is a good idea to use the native mode rather than ABC mode (as I've shown in this post). For those more mathematically trained, do check out the thesis of this work and the associated publications.

If you use this software for your work, follow these instructions for citation.

Print Friendly, PDF & Email



Creative Commons License
This work is licensed under a Creative Commons
Attribution 4.0 International License
.
7 comments Add yours
  1. Hi Dave,

    This was a great heads up on beginning with MCL. Do you have any suggestions on how to view the clusters from MCL in a network form ? I read references to BioLayout Express in the FANTOM5 paper but I couldn’t understand how. Have you tried to do so ?

    Thanks !

    1. Hi Aditi,

      I installed BioLayout but never really tried it. The Markov clusters don’t have any structure, i.e., they are an unstructured list, so I’m not sure what’s the best way of visualising them. I believe Kenneth Baillie was the person who was behind the network figure in the FANTOM5 paper; perhaps you could try emailing him for advice.

      Cheers,

      Dave

  2. Dave – thank you for this very helpful post. Do you have any thoughts on how the algorithm performs on sparse graphs such as in citation data? Is there an upper limit on the adjacency matrix on say a 64 Mb machine? 1 million x 1 million?

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.