## Rand Index versus the Adjusted Rand Index

I wrote about the Rand Index (RI) and the Adjusted Rand Index (ARI) in the last two posts but how do we interpret the indices and how are they different? The RI is: where \$\$a\$\$ and \$\$b\$\$ are the number of times a pair of items was clustered concordantly in two different sets. I wrote…

In my last post, I wrote about the Rand index. This post will be on the Adjusted Rand index (ARI), which is the corrected-for-chance version of the Rand index:

## The Rand index

I’ve been looking for ways to compare clustering results and through my searching I came across something called the Rand index. In this short post, I explain how this index is calculated.

## Quantile normalisation in R

Updated 2019 October 11th to explain the index_to_mean function. From Wikipedia: In statistics, quantile normalization is a technique for making two distributions identical in statistical properties. To quantile normalize two or more distributions to each other, without a reference distribution, sort as before, then set to the average (usually, arithmetical mean) of the distributions. So…

## Markov chain

A Markov chain is a mathematical system that undergoes transitions from one state to another on a state space in a stochastic (random) manner. Examples of Markov chains include the board game snakes and ladders, where each state represents the position of a player on the board and a player moves between states (different positions…

## Probability

The fundamental idea of inferential statistics is determining the probability of obtaining the observed data when we assume the null hypothesis is true. For example, if we roll a die 10 times and got 10 sixes, what is the probability of observing this result if we assume the null hypothesis that the die was fair?…

## Set notation

I’ve just started the Mathematical Biostatistics Boot Camp 1 and to help me remember the set notations introduced in the first lecture, I’ll include them here: The sample space, \$\$\Omega\$\$ (upper case omega), is the collection of possible outcomes of an experiment, such as a die roll: \$\$!\Omega = \{1, 2, 3, 4, 5, 6\}\$\$…

## Predicting cancer

So far I’ve come across four machine learning methods, which includes random forests, classification trees, hierarchical clustering and k-means clustering. Here I use all four of these methods (plus SVMs) towards predicting cancer, or more specifically malignant cancers using the Wisconsin breast cancer dataset.

## Singular Vector Decomposition using R

In linear algebra terms, a Singular Vector Decomposition (SVD) is the decomposition of a matrix X into three matrices, each having special properties. If X is a matrix with each variable in a column and each observation in a row then the SVD is \$\$!X = UDV^T\$\$ where the columns of U are orthogonal (left…

## On curve fitting using R

For linear relationships we can perform a simple linear regression. For other relationships we can try fitting a curve. From Wikipedia: Curve fitting is the process of constructing a curve, or mathematical function, that has the best fit to a series of data points, possibly subject to constraints. I will use the dataset from this…