Visualising hierarchical clustering results

I've written about hierarchical clustering before as an attempt to understand it better. Within R, you can plot the hierarchical clustering results however when working with a large dataset you may produce plots like these where all the labels are overlapping:

and

As you can see you can't see any of the labels. During my honours year, I worked in a molecular evolution lab so I had learned about the Newick format, so I was wondering if there was a way to export the hierarchical clustering results into Newick format, so that I could use TreeView or some other tree visualising program to visualise or to manipulate the results. After some searching, I found this post from the Getting Genetics Done blog, which was exactly what I wanted.

So in almost the exact same instructions as the post I linked above:

#install if necessary
source("http://bioconductor.org/biocLite.R")
biocLite("ctc")

library(ctc)
#hierarchical clustering of my rows using average linkage
hc <- hclust(d <- as.dist(1 - cor(t(data), method="pearson")), method="average")
write.table(hc2Newick(hc), file="hc.newick")

Now open up hc.newick using your text editor of choice (e.g. Vim), and delete the first line, which should be "x" (with the double quotation marks). On the second line it should start as "1" " (with the double quotation marks), remove those so that the beginning of the line is now an open parenthesis i.e. (. Lastly go to the end of this line and remove the double quotation mark i.e. ", so that the end of the line is a semicolon i.e. ;.

Open your tree viewing program of choice, for example Dendroscope and open the hc.newick file. I prefer this program over TreeView because it allows you to zoom right into the tree (mouse scrolling) and to copy node information to the clipboard (as well as many other features!). Within Dendroscope, you can try different types of layouts, such as a circular dendrogram:

Colouring the branches

I learned of the sparcl package from this Stack Overflow post. Here using the iris dataset, which comes with R, I colour the branches according to the species.

#install sparcl package
install.packages("sparcl")
#load package
library(sparcl)
names(iris)
[1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width"  "Species"
hh <- hclust(dist(iris[,c(1:4)]))
table(iris$Species)

    setosa versicolor  virginica 
        50         50         50
species_as_numeric = as.numeric(iris$Species)
ColorDendrogram(hh,y=species_as_numeric,branchlength=2)

colour_hierarchical_clustering
The three species, setosa, versicolor and virginica, are mostly clustered together based on their sepal length, sepal width, petal length and petal width.

Print Friendly, PDF & Email



Creative Commons License
This work is licensed under a Creative Commons
Attribution 4.0 International License
.

Leave a Reply

Your email address will not be published. Required fields are marked *