## Markov chain

A Markov chain is a mathematical system that undergoes transitions from one state to another on a state space in a stochastic (random) manner. Examples of Markov chains include the board game snakes and ladders, where each state represents the position of a player on the board and a player moves between states (different positions…

## Tissue specificity

Updated 2017 October 14th A key measure in information theory is entropy, which is the amount of uncertainty involved in a random process; the lower the uncertainty, the lower the entropy. For example, there is lower entropy in a fair coin flip versus a fair die roll since there are more possible outcomes with a…

## Set notation

I’ve just started the Mathematical Biostatistics Boot Camp 1 and to help me remember the set notations introduced in the first lecture, I’ll include them here: The sample space, $$\Omega$$ (upper case omega), is the collection of possible outcomes of an experiment, such as a die roll: $$!\Omega = \{1, 2, 3, 4, 5, 6\}$$…

## Comparing different distributions

Updated 2017 September 7th The Kolmogorov-Smirnov test can be used to test whether two underlying one-dimensional probability distributions differ. As noted in the Wikipedia article: Note that the two-sample test checks whether the two data samples come from the same distribution. This does not specify what that common distribution is (e.g. whether it’s normal or…

## The Poisson distribution

A Poisson distribution is the probability distribution that results from a Poisson experiment. A probability distribution assigns a probability to possible outcomes of a random experiment. A Poisson experiment has the following properties: The outcomes of the experiment can be classified as either successes or failures. The average number of successes that occurs in a…

## Manual linear regression analysis using R

Updated 2017 September 5th The aim of linear regression is to find the equation of the straight line that fits the data points the best; the best line is one that minimises the sum of squared residuals of the linear regression model. The equation of a straight line is: where is the slope or gradient…

## Step by step Principal Component Analysis using R

I’ve always wondered what goes on behind the scenes of a Principal Component Analysis (PCA). I found this extremely useful tutorial that explains the key concepts of PCA and shows the step by step calculations. Here, I use R to perform each step of a PCA as per the tutorial. Our dataset visualised on the…

## Creating a correlation matrix with R

Updated 2014 January 6th This post on creating a correlation matrix with R was published in 2012 on January the 31st and has become one of the most viewed posts. I’ve learned a bit more since then, so I have updated and improved this post. Incentive Let $$A$$ be a $$m \times n$$ matrix, where…

## Using R to obtain basic statistics on your dataset

Updated: 2014 June 20th Most of the data I work with are represented as tables i.e. with rows and columns. R makes it easy to store (as data frames) and process such data to produce some basic statistics. Here are just some R functions that calculate some basic, but nevertheless useful, statistics. I will use…

## Pearson vs. Spearman correlation

Correlation measures are commonly used to show how correlated two sets of datasets are. A commonly used measure is the Pearson correlation. To illustrate when not to use a Pearson correlation: If we remove the 2,000 value: Use a non-parametric correlation (e.g. Spearman’s rank) measure if your dataset has outliers. It would probably be best…