Rand Index versus the Adjusted Rand Index

I wrote about the Rand Index (RI) and the Adjusted Rand Index (ARI) in the last two posts but how do we interpret the indices and how are they different? The RI is: where $$a$$ and $$b$$ are the number of times a pair of items was clustered concordantly in two different sets. I wrote…

Continue Reading

The Rand index

I’ve been looking for ways to compare clustering results and through my searching I came across something called the Rand index. In this short post, I explain how this index is calculated.

Continue Reading

Quantile normalisation in R

Updated 2019 October 11th to explain the index_to_mean function. From Wikipedia: In statistics, quantile normalization is a technique for making two distributions identical in statistical properties. To quantile normalize two or more distributions to each other, without a reference distribution, sort as before, then set to the average (usually, arithmetical mean) of the distributions. So…

Continue Reading

Markov chain

A Markov chain is a mathematical system that undergoes transitions from one state to another on a state space in a stochastic (random) manner. Examples of Markov chains include the board game snakes and ladders, where each state represents the position of a player on the board and a player moves between states (different positions…

Continue Reading

Probability

The fundamental idea of inferential statistics is determining the probability of obtaining the observed data when we assume the null hypothesis is true. For example, if we roll a die 10 times and got 10 sixes, what is the probability of observing this result if we assume the null hypothesis that the die was fair?…

Continue Reading

Set notation

I’ve just started the Mathematical Biostatistics Boot Camp 1 and to help me remember the set notations introduced in the first lecture, I’ll include them here: The sample space, $$\Omega$$ (upper case omega), is the collection of possible outcomes of an experiment, such as a die roll: $$!\Omega = \{1, 2, 3, 4, 5, 6\}$$…

Continue Reading

Predicting cancer

So far I’ve come across four machine learning methods, which includes random forests, classification trees, hierarchical clustering and k-means clustering. Here I use all four of these methods (plus SVMs) towards predicting cancer, or more specifically malignant cancers using the Wisconsin breast cancer dataset.

Continue Reading

Singular Vector Decomposition using R

In linear algebra terms, a Singular Vector Decomposition (SVD) is the decomposition of a matrix X into three matrices, each having special properties. If X is a matrix with each variable in a column and each observation in a row then the SVD is $$!X = UDV^T$$ where the columns of U are orthogonal (left…

Continue Reading

On curve fitting using R

For linear relationships we can perform a simple linear regression. For other relationships we can try fitting a curve. From Wikipedia: Curve fitting is the process of constructing a curve, or mathematical function, that has the best fit to a series of data points, possibly subject to constraints. I will use the dataset from this…

Continue Reading