The Golden Rule of Bioinformatics

I’m a big fan of the book Bioinformatics Data Skills by Vince Buffalo and I highly recommend it to everyone who works in the bioinformatics field. The book introduces the reader to The Golden Rule of Bioinformatics, which is: Never ever trust your tools (or data). I am a strong proponent of this rule, which…

Continue Reading

Rand Index versus the Adjusted Rand Index

I wrote about the Rand Index (RI) and the Adjusted Rand Index (ARI) in the last two posts but how do we interpret the indices and how are they different? The RI is: where $$a$$ and $$b$$ are the number of times a pair of items was clustered concordantly in two different sets. I wrote…

Continue Reading

The Rand index

I’ve been looking for ways to compare clustering results and through my searching I came across something called the Rand index. In this short post, I explain how this index is calculated.

Continue Reading

Markov clustering

The Markov Cluster (MCL) Algorithm is an unsupervised cluster algorithm for graphs based on simulation of stochastic flow in graphs. Markov clustering was the work of Stijn van Dongen and you can read his thesis on the Markov Cluster Algorithm. The work is based on the graph clustering paradigm, which postulates that natural groups in…

Continue Reading

Visualising hierarchical clustering results

I’ve written about hierarchical clustering before as an attempt to understand it better. Within R, you can plot the hierarchical clustering results however when working with a large dataset you may produce plots like these where all the labels are overlapping: and As you can see you can’t see any of the labels. During my…

Continue Reading

Phylogenetic profiling

On my wiki I have a short summary of phylogenetic profiling. The program MrBayes is used for Bayesian inference for phylogeny and can be used for inferring relationships using binary type data such as phylogenetic profiles. The input to MrBayes is a NEXUS file and here is the example I will use: #NEXUS begin data;…

Continue Reading

Clustering mapped reads

Updated 2014 October 8th to include an analysis using CAGE data from ENCODE and rewrote parts of the post. In this post I will write about a read clustering method called paraclu, which allows mapped reads to be clustered together. This is particularly useful when working with CAGE data, where transcription start sites (TSSs) are…

Continue Reading

Finding genes with co-expression patterns

Can the R bioconductor package “WGCNA” find artefactually created modules? Firstly some (subpar) code to generate an artefactual list of genes with co-expression patterns (modules): Running the code: ./generate_random_module.pl 10 1000 20 > 10_sample_1000_list_20_module.tsv Patterns: 1 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 1 0 1 1…

Continue Reading