Quantile normalisation in R

Updated 2015 January 14th to include a slide from Rafael.

From Wikipedia:

In statistics, quantile normalization is a technique for making two distributions identical in statistical properties. To quantile normalize two or more distributions to each other, without a reference distribution, sort as before, then set to the average (usually, arithmetical mean) of the distributions. So the highest value in all cases becomes the mean of the highest values, the second highest value becomes the mean of the second highest values, and so on.

Here, I follow the simple example on Wikipedia using R. Firstly, let's create the test dataset:

Continue reading

Markov chain

A Markov chain is a mathematical system that undergoes transitions from one state to another on a state space in a stochastic (random) manner. Examples of Markov chains include the board game snakes and ladders, where each state represents the position of a player on the board and a player moves between states (different positions on the board) by rolling a dice. An important property of Markov chains, called the Markov property, is the memoryless property of the stochastic process. Basically what this means is that the transition between states depends only on the current state and not on the states preceding the current state; in terms of the board game, your next position on the board depends only on where you are currently positioned and not on the sequence of moves that got you there. Another way of thinking about it is that the future is independent of the past, given the present.

Continue reading

Probability

The fundamental idea of inferential statistics is determining the probability of obtaining the observed data when we assume the null hypothesis is true. For example, if we roll a die 10 times and got 10 sixes, what is the probability of observing this result if we assume the null hypothesis that the die was fair? If the die is fair, the probability of getting 10 sixes in 10 rolls is \frac{1}{6}^{10} = 1.653817e-08, which is a very low probability. Since it's extremely unlikely that we observe 10 sixes on 10 rolls of a fair die by chance, we should reject the null hypothesis. This probability is the p-value.

Let's consider a less extreme case than the previous example. Here I will use the example from the first lecture of the Statistics for Neuroscience (9506) course, whereby a person (the lecturer) claimed that he had to ability to distinguish two different brands of espresso. Our null hypothesis in this case, is that the lecturer doesn't have the ability to distinguish the brands and is simply guessing. We come up with an experiment to test his ability by giving him 8 cups of espresso, where 4 are from brand A and the other 4 are from brand B, and ask him to separate them into two groups. If he managed to correctly group the 8 cups into their respective brands, what is the probability of getting this result if we assume the null hypothesis is true, i.e. what's the probability of getting this result just by chance?

Continue reading

Set notation

I've just started the Mathematical Biostatistics Boot Camp 1 and to help me remember the set notations introduced in the first lecture, I'll include them here:

The sample space, \Omega (upper case omega), is the collection of possible outcomes of an experiment, such as a die roll:

\Omega = \{1, 2, 3, 4, 5, 6\}

An event, say E, is a subset of \Omega, such as the even dice rolls:

E = \{2, 4, 6\}

An elementary or simple event is a particular result of an experiment, such as the roll of 4 (represented as a lowercase omega):

\omega = 4

A null event or the empty set is represented as \emptyset.

Continue reading

Predicting cancer

So far I've come across four machine learning methods, which includes random forests, classification trees, hierarchical clustering and k-means clustering. Here I use all four of these methods (plus SVMs) towards predicting cancer, or more specifically malignant cancers using the Wisconsin breast cancer dataset.

wget http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data
cat breast-cancer-wisconsin.data | head -5
1000025,5,1,1,1,2,1,3,1,1,2
1002945,5,4,4,5,7,10,3,2,1,2
1015425,3,1,1,1,2,2,3,1,1,2
1016277,6,8,8,1,3,4,3,7,1,2
1017023,4,1,1,3,2,1,3,1,1,2
wget http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.names
cat breast-cancer-wisconsin.names | tail -25 | head -16
7. Attribute Information: (class attribute has been moved to last column)

   #  Attribute                     Domain
   -- -----------------------------------------
   1. Sample code number            id number
   2. Clump Thickness               1 - 10
   3. Uniformity of Cell Size       1 - 10
   4. Uniformity of Cell Shape      1 - 10
   5. Marginal Adhesion             1 - 10
   6. Single Epithelial Cell Size   1 - 10
   7. Bare Nuclei                   1 - 10
   8. Bland Chromatin               1 - 10
   9. Normal Nucleoli               1 - 10
  10. Mitoses                       1 - 10
  11. Class:                        (2 for benign, 4 for malignant)

Continue reading

Singular Vector Decomposition using R

In linear algebra terms, a Singular Vector Decomposition (SVD) is the decomposition of a matrix X into three matrices, each having special properties. If X is a matrix with each variable in a column and each observation in a row then the SVD is

X = UDV^T

where the columns of U are orthogonal (left singular vectors), the columns of V are orthogonal (right singluar vectors) and D is a diagonal matrix (singular values). Here I perform a SVD on the iris dataset in R.

Continue reading

On curve fitting using R

For linear relationships we can perform a simple linear regression. For other relationships we can try fitting a curve. From Wikipedia:

Curve fitting is the process of constructing a curve, or mathematical function, that has the best fit to a series of data points, possibly subject to constraints.

I will use the dataset from this question on Stack Overflow.

Using the example dataset

x <- c(32,64,96,118,126,144,152.5,158)
y <- c(99.5,104.8,108.5,100,86,64,35.3,15)
#we will make y the response variable and x the predictor
#the response variable is usually on the y-axis
plot(x,y,pch=19)

Continue reading