## Building a classification tree in R

In week 6 of the Data Analysis course offered freely on Coursera, there was a lecture on building classification trees in R (also known as decision trees). I thoroughly enjoyed the lecture and here I reiterate what was taught, both to re-enforce my memory and for sharing purposes. I will jump straight into building a…

## K means clustering

Updated: 2014 March 13th From Wikipedia: k-means clustering is a method of vector quantization, originally from signal processing, that is popular for cluster analysis in data mining. k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the…

## Random Forests in predicting wines

Updated 2014 September 17th to reflect changes in the R packages Source http://mkseo.pe.kr/stats/?p=220. Using Random Forests in predicting wines derived from three different cultivars. Download the wine data set from the Machine Learning Repository.

## Comparing different distributions

Updated 2017 September 7th The Kolmogorov-Smirnov test can be used to test whether two underlying one-dimensional probability distributions differ. As noted in the Wikipedia article: Note that the two-sample test checks whether the two data samples come from the same distribution. This does not specify what that common distribution is (e.g. whether it’s normal or…

## Variance in RNA-Seq data

Updated 2014 April 18th For this post I will use data from this study, that has been nicely summarised already to examine the variance in RNA-Seq data. Briefly, the study used LNCaP cells, which are androgen-sensitive human prostate adenocarcinoma cells, and treated the cells with DHT and with a mock treatment as the control. The…

## The Poisson distribution

A Poisson distribution is the probability distribution that results from a Poisson experiment. A probability distribution assigns a probability to possible outcomes of a random experiment. A Poisson experiment has the following properties: The outcomes of the experiment can be classified as either successes or failures. The average number of successes that occurs in a…

## Manual linear regression analysis using R

Updated 2017 September 5th The aim of linear regression is to find the equation of the straight line that fits the data points the best; the best line is one that minimises the sum of squared residuals of the linear regression model. The equation of a straight line is: \$\$!y = mx + b\$\$ where…

## Step by step Principal Component Analysis using R

I’ve always wondered what goes on behind the scenes of a Principal Component Analysis (PCA). I found this extremely useful tutorial that explains the key concepts of PCA and shows the step by step calculations. Here, I use R to perform each step of a PCA as per the tutorial. Our dataset visualised on the…

## Using R to obtain basic statistics on your dataset

Updated: 2014 June 20th Most of the data I work with are represented as tables i.e. with rows and columns. R makes it easy to store (as data frames) and process such data to produce some basic statistics. Here are just some R functions that calculate some basic, but nevertheless useful, statistics. I will use…

## edgeR vs. DESeq using pnas_expression.txt

Firstly from Davis’s homepage download the file pnas_expression.txt. For more information on the dataset please refer to the edgeR manual and this paper. The latest R version at the time of writing is R 2.13.1. You can download it from here. Install bioconductor and the required packages: source(“http://www.bioconductor.org/biocLite.R”) biocLite() biocLite(“DESeq”) biocLite(“edgeR”) A filtering criteria of…