The Markov Cluster (MCL) Algorithm is an unsupervised cluster algorithm for graphs based on simulation of stochastic flow in graphs. Markov clustering was the work of Stijn van Dongen and you can read his thesis on the Markov Cluster Algorithm. The work is based on the graph clustering paradigm, which postulates that natural groups in graphs (something we aim to look for) have the following property:
A random walk in G that visits a dense cluster will likely not leave the cluster until many of its vertices have been visited.
So in my limited understanding, the MCL algorithm simulates flow within a graph and promotes flow in a highly connected region and demotes otherwise, thus revealing natural groups within the graph. This animation illustrates the process.
For this post, I will create a simple dataset to test the software. To begin, download and install mcl:
wget http://micans.org/mcl/src/mcl-latest.tar.gz tar xzf mcl-latest.tar.gz cd mcl-12-068/ ./configure --prefix=/home/davetang/src/mcl-12-068/ make install cd ~/bin ln -s /home/davetang/src/mcl-12-068/bin/* .
The most basic usage of the mcl program is:
mcl <infile> --abc -o outfile
where the ABC format is a three column file with two labels (A and B) and a numerical value (C), separated by white space. A single parameter, the inflation option (-I), controls the granularity of the output clustering. In standard usage of the program this parameter is the only one that may require changing.
The output is a file with one or more lines, where each line is a cluster of tab separated labels. Here's the example from the mcl man page:
#examining the input cat infile cat hat 0.2 hat bat 0.16 bat cat 1.0 bat bit 0.125 bit fit 0.25 fit hit 0.5 hit bit 0.16 #running mcl mcl infile --abc -o outfile #examining the output cat outfile cat hat bat bit fit hit
I will create a test dataset with R:
#for visualisation purposes #install the igraph package install.packages('igraph') #load library library(igraph) #how many total edges edge <- 500 #ratio ratio <- c(0.1, 0.1, 0.2, 0.1, 0.1, 0.2, 0.2) #set seed set.seed(12345) #sample 50 letters from {a, b, c, d, e} a <- sample(x=letters[1:5], size=ratio[1] * edge, replace=T) aa <- sample(x=letters[1:5], size=ratio[1] * edge, replace=T) #sample 50 letters from {f, g, h, i, j} b <- sample(x=letters[6:10], size=ratio[2] * edge, replace=T) bb <- sample(x=letters[6:10], size=ratio[2] * edge, replace=T) #and so on c <- sample(x=letters[1:10], size=ratio[3] * edge, replace=T) cc <- sample(x=letters[1:10], size=ratio[3] * edge, replace=T) d <- sample(x=letters[17:21], size=ratio[4] * edge, replace=T) dd <- sample(x=letters[17:21], size=ratio[4] * edge, replace=T) e <- sample(x=letters[22:26], size=ratio[5] * edge, replace=T) ee <- sample(x=letters[22:26], size=ratio[5] * edge, replace=T) f <- sample(x=letters[17:26], size=ratio[6] * edge, replace=T) ff <- sample(x=letters[17:26], size=ratio[6] * edge, replace=T) g <- sample(x=letters, size=ratio[7] * edge, replace=T) gg <- sample(x=letters, size=ratio[7] * edge, replace=T) #concatenate the letters first <- c(a,b,c,d,e,f,g) second <- c(aa,bb,cc,dd,ee,ff,gg) #combine random <- cbind(first, second) head(random) # first second #[1,] "d" "e" #[2,] "e" "e" #[3,] "d" "b" #[4,] "e" "b" #[5,] "c" "d" #[6,] "a" "c" #create igraph graph random.network <- graph.data.frame(random, directed=F) #plot graph plot(random.network)
The idea of the above code was to create more connections between items sampled in the same space; for example, a and aa were sampled from the first five letters of the alphabet and therefore they will have more connections to each other. The vectors b and bb do the same thing but for the sixth to tenth letter of the alphabet. The vectors c and cc, enrich connections between the first ten letters of the alphabet. This process is repeated for the seventeenth to twenty-sixth letters of the alphabet.
There are two dense regions; connections between letters from a to j and letters from q to z. Within these two dense regions are more layers of connections between the letters a to e, f to j, q to u, and v to z.
We can also visualise the connections with a heatmap, following this awesome post:
#using the previous object dim(random) [1] 500 2 #converting into a table random_table <- table(random[,1], random[,2]) #viewing a subset of the table random_table[1:5,1:5] a b c d e a 6 9 5 4 11 b 4 4 7 5 7 c 5 2 6 9 6 d 6 7 2 9 9 e 6 6 1 8 5 #as you can see the table is not symmetric #i.e a to b is not equal to b to a #since I don't care about directionality #I will combine the two connection <- random_table[lower.tri(random_table)] + t(random_table)[lower.tri(random_table)] #create a new table with the new values #for an explanation of these steps see my recent post #http://davetang.org/muse/2014/01/22/making-symmetric-matrices-in-r/ connection_table <- matrix(rep(x=0,26*26), nrow=26, dimnames=list(letters, letters) ) connection_table[lower.tri(connection_table)] <- connection connection_table <- t(connection_table) connection_table[lower.tri(connection_table)] <- connection diag(connection_table) <- diag(random_table) #check for symmetry isSymmetric(connection_table) [1] TRUE #convert object into a molten data frame library(ggplot2) library(reshape2) connection_table_melted <- melt(connection_table) #plot qplot(x=Var1, y=Var2, data=connection_table_melted, fill=value, geom="tile")
Heatmap of the connections between the letters of the alphabet.
Now let's output the data from R for input into the mcl program:
dim(random) [1] 500 2 for_mcl <- cbind(random, score=rep(0.95,edge)) filename <- paste('for_mcl_', edge, '.tsv', sep='') write.table(for_mcl, file=filename, quote=F, sep="\t", row.names=F, col.names=F)
Let's run mcl with a set of inflation parameters using this script:
#contents of run.sh #!/bin/bash infile=$1 base=`basename $infile .tsv` for i in {1..10} do mcl $infile --abc -I $i -te 16 -o ${base}_$i.out done #running the script run.sh for_mcl_500.tsv #inflation parameter I = 1 #resulting in one group cat for_mcl_500_1.out | sed 's/\t//g' debcagjihfqtsruwzyxvnpolmk #I = 2 and 3 #resulting in two groups cat for_mcl_500_2.out | sed 's/\t//g' qtsruwzyxvnpolmk debcagjihf #I = 4 #resulting in 5 groups cat for_mcl_500_4.out | sed 's/\t//g' qtsruwzyxvnpom dbagjhf eci l k #I = 5 #many groups cat for_mcl_500_5.out| sed 's/\t//g' quwyxvn dbaghf tsrz ec j i p o l m k
As we increase the inflation parameter, we observe some of the groups that we artificially created. At I = 2 and 3, we observed the grouping of the first 10 letters of the alphabet (debcagjihf), but the grouping of the last 10 letters were combined with the 6 middle letters of the alphabet (qtsruwzyxvnpolmk). At I = 4, the letter k and l were singled out and we observe finer granularity of the first 10 letters. At I = 5, the middle letters of the alphabet (klmop) are singled out, as well as j and i.
Conclusions
Markov clustering can be used as a tool for revealing natural groups within a highly connected graph. As stated in the manual, if clustering is part of a larger workflow where it is desirable to analyse and compare different clusters, then it is a good idea to use the native mode rather than ABC mode (as I've shown in this post). For those more mathematically trained, do check out the thesis of this work and the associated publications.
If you use this software for your work, follow these instructions for citation.
This work is licensed under a Creative Commons
Attribution 4.0 International License.
good post
Hi Dave,
This was a great heads up on beginning with MCL. Do you have any suggestions on how to view the clusters from MCL in a network form ? I read references to BioLayout Express in the FANTOM5 paper but I couldn’t understand how. Have you tried to do so ?
Thanks !
Hi Aditi,
I installed BioLayout but never really tried it. The Markov clusters don’t have any structure, i.e., they are an unstructured list, so I’m not sure what’s the best way of visualising them. I believe Kenneth Baillie was the person who was behind the network figure in the FANTOM5 paper; perhaps you could try emailing him for advice.
Cheers,
Dave
Hi Dave, how can you test whether the graph you made is percolating?
Dave – thank you for this very helpful post. Do you have any thoughts on how the algorithm performs on sparse graphs such as in citation data? Is there an upper limit on the adjacency matrix on say a 64 Mb machine? 1 million x 1 million?
I have no idea. Perhaps you can give it a go and if you have questions, try asking the mailing list https://micans.org/mcl/index.html?sec_questions.
Will do, thanks.