Markov clustering

The Markov Cluster (MCL) Algorithm is an unsupervised cluster algorithm for graphs based on simulation of stochastic flow in graphs. Markov clustering was the work of Stijn van Dongen and you can read his thesis on the Markov Cluster Algorithm. The work is based on the graph clustering paradigm, which postulates that natural groups in graphs (something we aim to look for) have the following property:

A random walk in G that visits a dense cluster will likely not leave the cluster until many of its vertices have been visited.

Continue reading

Visualising hierarchical clustering results

I've written about hierarchical clustering before as an attempt to understand it better. Within R, you can plot the hierarchical clustering results however when working with a large dataset you may produce plots like these where all the labels are overlapping:

and

As you can see you can't see any of the labels. During my honours year, I worked in a molecular evolution lab so I had learned about the Newick format, so I was wondering if there was a way to export the hierarchical clustering results into Newick format, so that I could use TreeView or some other tree visualising program to visualise or to manipulate the results. After some searching, I found this post from the Getting Genetics Done blog, which was exactly what I wanted.

So in almost the exact same instructions as the post I linked above:

#install if necessary
source("http://bioconductor.org/biocLite.R")
biocLite("ctc")

library(ctc)
#hierarchical clustering of my rows using average linkage
hc <- hclust(d <- as.dist(1 - cor(t(data), method="pearson")), method="average")
write.table(hc2Newick(hc), file="hc.newick")

Now open up hc.newick using your text editor of choice (e.g. Vim), and delete the first line, which should be "x" (with the double quotation marks). On the second line it should start as "1" " (with the double quotation marks), remove those so that the beginning of the line is now an open parenthesis i.e. (. Lastly go to the end of this line and remove the double quotation mark i.e. ", so that the end of the line is a semicolon i.e. ;.

Open your tree viewing program of choice, for example Dendroscope and open the hc.newick file. I prefer this program over TreeView because it allows you to zoom right into the tree (mouse scrolling) and to copy node information to the clipboard (as well as many other features!). Within Dendroscope, you can try different types of layouts, such as a circular dendrogram:

Colouring the branches

I learned of the sparcl package from this Stack Overflow post. Here using the iris dataset, which comes with R, I colour the branches according to the species.

#install sparcl package
install.packages("sparcl")
#load package
library(sparcl)
names(iris)
[1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width"  "Species"
hh <- hclust(dist(iris[,c(1:4)]))
table(iris$Species)

    setosa versicolor  virginica 
        50         50         50
species_as_numeric = as.numeric(iris$Species)
ColorDendrogram(hh,y=species_as_numeric,branchlength=2)

colour_hierarchical_clustering
The three species, setosa, versicolor and virginica, are mostly clustered together based on their sepal length, sepal width, petal length and petal width.

Phylogenetic profiling

On my wiki I have a short summary of phylogenetic profiling. The program MrBayes is used for Bayesian inference for phylogeny and can be used for inferring relationships using binary type data such as phylogenetic profiles.

The input to MrBayes is a NEXUS file and here is the example I will use:

#NEXUS
begin data;
  dimensions ntax=10 nchar=10;
  format datatype=restriction interleave=no gap=-;
  matrix
  homer 1100000000
  marge 1100000000
  sheldon 0011000000
  amy 0011000000
  lily 0000110000
  marshall 0000110000
  wilma 0000001100
  fred 0000001100
  scooby 0000000011
  shaggy 0000000011
  ;
end;

The input to MrBayes (for more information on the commands refer to the MrBayes manual).


set autoclose=yes nowarn=yes
execute test.nex
mcmc Nchains=8 Ngen=1000000 Temp=0.100000
sump burnin=5000
sumt burnin=5000
quit

MrBayes outputs consensus trees (*.con) in the Newick format that can be visualised using TreeView.

This may be useful for converting large datasets into binary, although losing information, to observe whether any relationships exist.

Clustering mapped reads

Updated 2014 October 8th to include an analysis using CAGE data from ENCODE and rewrote parts of the post.

In this post I will write about a read clustering method called paraclu, which allows mapped reads to be clustered together. This is particularly useful when working with CAGE data, where transcription start sites (TSSs) are known to initiate at different positions but are all providing information on the promoter activity of a transcript, so it is useful to cluster the TSSs together. In addition, paraclu allows different levels of clustering, so you can choose the level that you want. Furthermore, studying the clusters at different levels can reveal subtle properties of promoters; this is akin to adjusting the bin size of histograms to see if certain properties arise.

Continue reading

PCA and rggobi

#generate row names
rname <- c(1:10000)
#generate column names
cname <- c(rep("control",20),rep("test",20))
#initialise array with 10,000 rows and 40 columns
data <- array(0,dim=c(10000,40),dimnames=list(rname,cname))
for (i in 1:nrow(data)){
   x_mean <- sample(1:10000,1)
   x_sd <- sample(1:1000,1)
   y_mean <- sample(1:10000,1)
   y_sd <- sample(1:1000,1)
   x <- rnorm(20,mean=x_mean,sd=x_sd)
   y <- rnorm(20,mean=y_mean,sd=y_sd)
   for (j in 1:20){
      data[i,j] <- x[j]
   }
   for (k in 1:20){
      k_1 <- 20 + k
      data[i,k_1] <- y[k]
   }
}
data.pca <- prcomp(data)
x <- data.pca$rotation[,"PC1"]
y <- data.pca$rotation[,"PC2"]
xy <- data.frame(x,y)
library(rggobi)
g <- ggobi(xy)


I labelled only two samples since the text would overlap. Samples 1 to 20 are located near the 16 and samples 21 to 40 near the 37, as expected.

Finding genes with co-expression patterns

Can the R bioconductor package "WGCNA" find artefactually created modules?

Firstly some (subpar) code to generate an artefactual list of genes with co-expression patterns (modules):

#!/usr/bin/perl

use strict;
use warnings;

my $usage = "Usage $0 <number of samples> <number of rows> <number of modules>\n";
my $number_of_sample = shift or die $usage;
my $number_of_row = shift or die $usage;
my $number_of_module = shift or die $usage;

my @master_pattern = ();

#print header
my $header = "id\t";
foreach my $sample (1 .. $number_of_sample){
   my $sample_name = 'sample' . $sample;
   $header .= "$sample_name\t";
}
$header =~ s/\t$//;
print "$header\n";
#end printing header

my $chunk = sprintf("%.0f",$number_of_row / $number_of_module);
for (my $i = 1; $i <= $number_of_row; $i += $chunk){
   #determine module pattern
   my @pattern = ();
   for (1 .. $number_of_sample){
      #randomly generate 0 or 1
      my $rand = int(rand(2));
      push(@pattern,$rand);
   }
   my $pattern = "@pattern";
   push(@master_pattern,$pattern);
   #go in chunks or modules
   for (my $j = $i; $j < $i + $chunk; ++$j){
      my $p = join('',@pattern);
      my $expression = "${j}_$p\t";
      foreach my $pattern (@pattern){
         my $random_expression = random_expression($pattern);
         $expression .= "$random_expression\t";
      }
      $expression =~ s/\t$//;
      print "$expression\n";
   }
}

warn("Patterns:\n");
foreach my $pattern (@master_pattern){
   warn("$pattern\n");
}

sub random_expression {
   my ($which) = @_;
   my $rand = '';
   if ($which == '0'){
      $rand = '1000';
      while ($rand >= 500){
         $rand = int(rand(1000));
      }
   } else {
      $rand = '0';
      while ($rand <= 500){
         $rand = int(rand(1000));
      }
   }
  return $rand;
}

Running the code:

./generate_random_module.pl 10 1000 20 > 10_sample_1000_list_20_module.tsv

Patterns:
1 0 0 0 0 1 0 0 0 1
0 0 0 0 0 0 1 0 1 1
0 0 1 0 0 0 1 0 0 1
1 0 0 0 0 1 1 1 0 1
0 1 0 0 1 0 0 1 1 0
0 0 0 0 1 1 1 1 0 0
1 1 1 0 0 0 0 0 1 0
1 1 1 0 0 0 0 0 0 0
1 1 1 1 1 1 1 1 0 1
0 0 1 1 1 0 1 1 1 0
0 1 1 0 0 0 0 0 1 1
0 1 0 0 1 0 0 1 1 0
0 0 1 1 0 0 1 0 1 0
0 0 0 0 1 1 1 1 0 1
0 1 0 0 1 1 0 0 0 1
1 0 0 1 1 1 0 1 1 0
0 1 1 1 1 1 1 0 1 0
0 1 0 1 0 0 1 1 1 0
0 1 1 0 1 0 1 0 0 0
0 1 1 1 0 1 1 0 1 1

So there are 20 artefactually created co-expression patterns (1 for over expression and 0 for under expression). Some patterns look very similar.

library("WGCNA")

options(stringsAsFactors=FALSE)

data = read.table("10_sample_1000_list_20_module.tsv",sep="\t",header=TRUE)

test = as.data.frame(t(data[,-c(1)]))

names(test) = rownames(data)

rownames(test) = names(data)[-c(1)]

#gsg = goodSamplesGenes(test,verbose=3)

#gsg$allOK

sampleTree = flashClust(dist(test),method="average")

png (filename="clustering.png", width=2000, height=2000, pointsize=12)
par(cex=0.6)
par(mar=c(0,4,2,0))
plot(sampleTree, main="Sample clustering", sub="", xlab="", cex.lab=1.5,cex.axis=1.5,cex.main=2)
abline(h=40000,col="red")
dev.off ()  ##  Close the image

#powers = c(c(1:10), seq(from=12,to=20,by=2))
#
#sft = pickSoftThreshold(test,powerVector=powers, verbose=5)
#
net = blockwiseModules(test,power=6,minModuleSize=30,reassignThreshold=0, mergeCutHeight=0.25,numericLabels=TRUE,pamRespectsDendro=FALSE,saveTOMs=TRUE,saveTOMFileBase="testTOM",verbose=3)
#
table(net$colors)
#
#sizeGrWindow(12,9)
#
mergedColors = labels2colors(net$colors)
#
png (filename="plot_dendro_and_colors.png", width=2000, height=2000, pointsize=12)
plotDendroAndColors(net$dendrograms[[1]], mergedColors[net$blockGenes[[1]]], "Module colors", dendroLabels= FALSE, hang=0.03, addGuide = TRUE, guideHang = 0.05)
dev.off()
#
moduleLabels = net$colors
#
moduleColors = labels2colors(net$colors)
#
MEs = net$MEs

geneTree = net$dendrograms[[1]]

save(MEs, moduleLabels, moduleColors, geneTree, net, file = "number_2.RData")

A total of 15 modules are detected:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
108 106 96 94 94 61 56 55 54 54 53 50 46 43 30

Some of the patterns were very similar and were probably merged together, hence 15 out of the 20 artificial modules were detected. Also note that over expression was generated by randomly picking a number between 500 and 1000 and under expression was from 1 - 500. Looks pretty good!