Markov clustering

The Markov Cluster (MCL) Algorithm is an unsupervised cluster algorithm for graphs based on simulation of stochastic flow in graphs. Markov clustering was the work of Stijn van Dongen and you can read his thesis on the Markov Cluster Algorithm. The work is based on the graph clustering paradigm, which postulates that natural groups in graphs (something we aim to look for) have the following property:

A random walk in G that visits a dense cluster will likely not leave the cluster until many of its vertices have been visited.

So in my limited understanding, the MCL algorithm simulates flow within a graph and promotes flow in a highly connected region and demotes otherwise, thus revealing natural groups within the graph. This animation illustrates the process.

For this post, I will create a simple dataset to test the software. To begin, download and install mcl:

tar xzf mcl-latest.tar.gz
cd mcl-12-068/
./configure --prefix=/home/davetang/src/mcl-12-068/
make install
cd ~/bin
ln -s /home/davetang/src/mcl-12-068/bin/* .

The most basic usage of the mcl program is:

mcl <infile> --abc -o outfile

where the ABC format is a three column file with two labels (A and B) and a numerical value (C), separated by white space. A single parameter, the inflation option (-I), controls the granularity of the output clustering. In standard usage of the program this parameter is the only one that may require changing.

The output is a file with one or more lines, where each line is a cluster of tab separated labels. Here’s the example from the mcl man page:

#examining the input
cat infile
cat hat  0.2
hat bat  0.16
bat cat  1.0
bat bit  0.125
bit fit  0.25
fit hit  0.5
hit bit  0.16

#running mcl
mcl infile --abc -o outfile

#examining the output
cat outfile
cat     hat     bat
bit     fit     hit

I will create a test dataset with R:

#for visualisation purposes
#install the igraph package
#load library

#how many total edges
edge <- 500

ratio <- c(0.1, 0.1, 0.2, 0.1, 0.1, 0.2, 0.2)

#set seed

#sample 50 letters from {a, b, c, d, e}
a <- sample(x=letters[1:5], size=ratio[1] * edge, replace=T)
aa <- sample(x=letters[1:5], size=ratio[1] * edge, replace=T)

#sample 50 letters from {f, g, h, i, j}
b <- sample(x=letters[6:10], size=ratio[2] * edge, replace=T)
bb <- sample(x=letters[6:10], size=ratio[2] * edge, replace=T)

#and so on
c <- sample(x=letters[1:10], size=ratio[3] * edge, replace=T)
cc <- sample(x=letters[1:10], size=ratio[3] * edge, replace=T)
d <- sample(x=letters[17:21], size=ratio[4] * edge, replace=T)
dd <- sample(x=letters[17:21], size=ratio[4] * edge, replace=T)
e <- sample(x=letters[22:26], size=ratio[5] * edge, replace=T)
ee <- sample(x=letters[22:26], size=ratio[5] * edge, replace=T)
f <- sample(x=letters[17:26], size=ratio[6] * edge, replace=T)
ff <- sample(x=letters[17:26], size=ratio[6] * edge, replace=T)
g <- sample(x=letters, size=ratio[7] * edge, replace=T)
gg <- sample(x=letters, size=ratio[7] * edge, replace=T)

#concatenate the letters
first <- c(a,b,c,d,e,f,g)
second <- c(aa,bb,cc,dd,ee,ff,gg)

random <- cbind(first, second)

#     first second
#[1,] "d"   "e"   
#[2,] "e"   "e"   
#[3,] "d"   "b"   
#[4,] "e"   "b"   
#[5,] "c"   "d"   
#[6,] "a"   "c"

#create igraph graph <-, directed=F)

#plot graph

The idea of the above code was to create more connections between items sampled in the same space; for example, a and aa were sampled from the first five letters of the alphabet and therefore they will have more connections to each other. The vectors b and bb do the same thing but for the sixth to tenth letter of the alphabet. The vectors c and cc, enrich connections between the first ten letters of the alphabet. This process is repeated for the seventeenth to twenty-sixth letters of the alphabet.

mock_networkThere are two dense regions; connections between letters from a to j and letters from q to z. Within these two dense regions are more layers of connections between the letters a to e, f to j, q to u, and v to z.

We can also visualise the connections with a heatmap, following this awesome post:

#using the previous object
[1] 500   2

#converting into a table
random_table <- table(random[,1], random[,2])

#viewing a subset of the table
    a b c d  e
  a 6 9 5 4 11
  b 4 4 7 5  7
  c 5 2 6 9  6
  d 6 7 2 9  9
  e 6 6 1 8  5

#as you can see the table is not symmetric
#i.e a to b is not equal to b to a
#since I don't care about directionality
#I will combine the two
connection <- random_table[lower.tri(random_table)] + t(random_table)[lower.tri(random_table)]

#create a new table with the new values
#for an explanation of these steps see my recent post
connection_table <- matrix(rep(x=0,26*26),
                           dimnames=list(letters, letters)

connection_table[lower.tri(connection_table)] <- connection
connection_table <- t(connection_table)
connection_table[lower.tri(connection_table)] <- connection
diag(connection_table) <- diag(random_table)

#check for symmetry
[1] TRUE

#convert object into a molten data frame
connection_table_melted <- melt(connection_table)

qplot(x=Var1, y=Var2, data=connection_table_melted, fill=value, geom="tile")

heatmap_connectionHeatmap of the connections between the letters of the alphabet.

Now let’s output the data from R for input into the mcl program:

[1] 500    2
for_mcl <- cbind(random, score=rep(0.95,edge))
filename <- paste('for_mcl_', edge, '.tsv', sep='')
write.table(for_mcl, file=filename, quote=F, sep="\t", row.names=F, col.names=F)

Let’s run mcl with a set of inflation parameters using this script:

#contents of

base=`basename $infile .tsv`

for i in {1..10}
   do mcl $infile --abc -I $i -te 16 -o ${base}_$i.out

#running the script for_mcl_500.tsv

#inflation parameter I = 1
#resulting in one group
cat for_mcl_500_1.out | sed 's/\t//g'

#I = 2 and 3
#resulting in two groups
cat for_mcl_500_2.out | sed 's/\t//g'

#I = 4
#resulting in 5 groups
cat for_mcl_500_4.out | sed 's/\t//g'

#I = 5
#many groups
cat for_mcl_500_5.out| sed 's/\t//g'

As we increase the inflation parameter, we observe some of the groups that we artificially created. At I = 2 and 3, we observed the grouping of the first 10 letters of the alphabet (debcagjihf), but the grouping of the last 10 letters were combined with the 6 middle letters of the alphabet (qtsruwzyxvnpolmk). At I = 4, the letter k and l were singled out and we observe finer granularity of the first 10 letters. At I = 5, the middle letters of the alphabet (klmop) are singled out, as well as j and i.


Markov clustering can be used as a tool for revealing natural groups within a highly connected graph. As stated in the manual, if clustering is part of a larger workflow where it is desirable to analyse and compare different clusters, then it is a good idea to use the native mode rather than ABC mode (as I’ve shown in this post). For those more mathematically trained, do check out the thesis of this work and the associated publications.

If you use this software for your work, follow these instructions for citation.

Print Friendly, PDF & Email

Creative Commons License
This work is licensed under a Creative Commons
Attribution 4.0 International License
4 comments Add yours
  1. Hi Dave,

    This was a great heads up on beginning with MCL. Do you have any suggestions on how to view the clusters from MCL in a network form ? I read references to BioLayout Express in the FANTOM5 paper but I couldn’t understand how. Have you tried to do so ?

    Thanks !

    1. Hi Aditi,

      I installed BioLayout but never really tried it. The Markov clusters don’t have any structure, i.e., they are an unstructured list, so I’m not sure what’s the best way of visualising them. I believe Kenneth Baillie was the person who was behind the network figure in the FANTOM5 paper; perhaps you could try emailing him for advice.



Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.