Building a classification tree in R

In week 6 of the Data Analysis course offered freely on Coursera, there was a lecture on building classification trees in R (also known as decision trees). I thoroughly enjoyed the lecture and here I reiterate what was taught, both to re-enforce my memory and for sharing purposes. I will jump straight into building a…

Continue Reading

K means clustering

Updated: 2014 March 13th From Wikipedia: k-means clustering is a method of vector quantization, originally from signal processing, that is popular for cluster analysis in data mining. k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the…

Continue Reading

Random Forests in predicting wines

Updated 2014 September 17th to reflect changes in the R packages Source http://mkseo.pe.kr/stats/?p=220. Using Random Forests in predicting wines derived from three different cultivars. Download the wine data set from the Machine Learning Repository.

Continue Reading

Comparing different distributions

Updated 2017 September 7th The Kolmogorov-Smirnov test can be used to test whether two underlying one-dimensional probability distributions differ. As noted in the Wikipedia article: Note that the two-sample test checks whether the two data samples come from the same distribution. This does not specify what that common distribution is (e.g. whether it’s normal or…

Continue Reading

Variance in RNA-Seq data

Updated 2014 April 18th For this post I will use data from this study, that has been nicely summarised already to examine the variance in RNA-Seq data. Briefly, the study used LNCaP cells, which are androgen-sensitive human prostate adenocarcinoma cells, and treated the cells with DHT and with a mock treatment as the control. The…

Continue Reading

The Poisson distribution

A Poisson distribution is the probability distribution that results from a Poisson experiment. A probability distribution assigns a probability to possible outcomes of a random experiment. A Poisson experiment has the following properties: The outcomes of the experiment can be classified as either successes or failures. The average number of successes that occurs in a…

Continue Reading

Manual linear regression analysis using R

Updated 2017 September 5th The aim of linear regression is to find the equation of the straight line that fits the data points the best; the best line is one that minimises the sum of squared residuals of the linear regression model. The equation of a straight line is: $$!y = mx + b$$ where…

Continue Reading

Step by step Principal Component Analysis using R

I’ve always wondered what goes on behind the scenes of a Principal Component Analysis (PCA). I found this extremely useful tutorial that explains the key concepts of PCA and shows the step by step calculations. Here, I use R to perform each step of a PCA as per the tutorial. Our dataset visualised on the…

Continue Reading

Using R to obtain basic statistics on your dataset

Updated: 2014 June 20th Most of the data I work with are represented as tables i.e. with rows and columns. R makes it easy to store (as data frames) and process such data to produce some basic statistics. Here are just some R functions that calculate some basic, but nevertheless useful, statistics. I will use…

Continue Reading

edgeR vs. DESeq using pnas_expression.txt

Firstly from Davis’s homepage download the file pnas_expression.txt. For more information on the dataset please refer to the edgeR manual and this paper. The latest R version at the time of writing is R 2.13.1. You can download it from here. Install bioconductor and the required packages: source(“http://www.bioconductor.org/biocLite.R”) biocLite() biocLite(“DESeq”) biocLite(“edgeR”) A filtering criteria of…

Continue Reading