Markov chain

A Markov chain is a mathematical system that undergoes transitions from one state to another on a state space in a stochastic (random) manner. Examples of Markov chains include the board game snakes and ladders, where each state represents the position of a player on the board and a player moves between states (different positions on the board) by rolling a dice. An important property of Markov chains, called the Markov property, is the memoryless property of the stochastic process. Basically what this means is that the transition between states depends only on the current state and not on the states preceding the current state; in terms of the board game, your next position on the board depends only on where you are currently positioned and not on the sequence of moves that got you there. Another way of thinking about it is that the future is independent of the past, given the present.

Continue reading

Tissue specificity

Wikipedia has a definition of entropy with respect to information theory. The introduction of that article gives an example using a coin toss; if a coin toss is fair, the entropy rate for a fair coin toss is one bit per toss. However, if the coin is not fair, then the uncertainty, and hence the entropy rate, is lower.

The formula for the Shannon entropy is:

-\sum_{i=1}^n p(x_{i}) log_{b}p(x_{i})

Let's test this out:

Continue reading

Set notation

I've just started the Mathematical Biostatistics Boot Camp 1 and to help me remember the set notations introduced in the first lecture, I'll include them here:

The sample space, \Omega (upper case omega), is the collection of possible outcomes of an experiment, such as a die roll:

\Omega = \{1, 2, 3, 4, 5, 6\}

An event, say E, is a subset of \Omega, such as the even dice rolls:

E = \{2, 4, 6\}

An elementary or simple event is a particular result of an experiment, such as the roll of 4 (represented as a lowercase omega):

\omega = 4

A null event or the empty set is represented as \emptyset.

Continue reading

Comparing different distributions

I recently learned of the Kolmogorov-Smirnov Test and how one can use it to test whether two datasets are likely to be different, i.e. comparing different distributions. Strictly speaking, the p-value gives us a probability of whether or not we can reject the null hypothesis, which is that two datasets have the same distribution. Using R:

#example 1
x <- rpois(n=1000,lambda=100)
y <- rpois(n=1000,lambda=100)

	Two-sample Kolmogorov-Smirnov test

data:  x and y
D = 0.036, p-value = 0.5361
alternative hypothesis: two-sided

Warning message:
In ks.test(x, y) : p-value will be approximate in the presence of ties

#example 2
#how about having a large population
#and drawing different samples for comparison
pop <- rpois(n=1000000, lambda=1000)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    848     979    1000    1000    1021    1165

#sample 1
sample_1 <- sample(x=pop, size=1000)
#add some noise/jitter
sample_1 <- jitter(sample_1, factor=10)

sample_2 <- sample(x=pop, size=1000)
#add some noise/jitter
sample_2 <- jitter(sample_2, factor=10)


	Two-sample Kolmogorov-Smirnov test

data:  sample_1 and sample_2
D = 0.026, p-value = 0.8879
alternative hypothesis: two-sided

In both examples, we cannot reject the null hypothesis that the two distributions are different. Since in the first example they are both from a Poisson distribution with the same lambda and in the second example, the samples were drawn from the same population.

x <- rpois(n=1000,lambda=100)
y <- rnorm(n=1000,mean=100)

	Two-sample Kolmogorov-Smirnov test

data:  x and y
D = 0.433, p-value < 2.2e-16
alternative hypothesis: two-sided

Warning message:
In ks.test(x, y) : p-value will be approximate in the presence of ties

In this case we get an extremely low p-value, and we can reject the null, which is that both the distributions are the same and they are not (one is a Normal distribution and the other a Poisson distribution).

The warning messages are due to the implementation of the KS test in R, which expects a continuous distribution and thus there should not be any identical values in the two datasets i.e. ties. I've read several sources and they all mention that the KS test can deal with both discrete and continuous data (I'm guessing because it mainly deals with cumulative quantiles) but I'm not sure about the implementation in R.


Testing for normality

On a related topic, how can we check whether our dataset is normally distributed? One can use use the QQ-normal plot to check if the values lie on a straight diagonal line.

norm_data <- rnorm(n=1000,mean=200,sd=50)


gamma_data <- rgamma(n=1000,shape=1)


What about some statistical test? Let's try the Shapiro-Wilk test, which tests the null hypothesis that a sample came from a normally distributed population.

#we can't reject the null, since it really came from a normally distribution population

	Shapiro-Wilk normality test

data:  norm_data
W = 0.9984, p-value = 0.4765
#we can reject the null, since the values were from a gamma distribution

	Shapiro-Wilk normality test

data:  gamma_data
W = 0.8146, p-value < 2.2e-16
#ks.test reject null that both are the same
ks.test(norm_data, gamma_data)

	Two-sample Kolmogorov-Smirnov test

data:  norm_data and gamma_data
D = 1, p-value < 2.2e-16
alternative hypothesis: two-sided

See this useful discussion on StackExchange on testing for normality.

The Poisson distribution

A Poisson distribution is the probability distribution that results from a Poisson experiment. A probability distribution assigns a probability to possible outcomes of a random experiment. A Poisson experiment has the following properties:

  1. The outcomes of the experiment can be classified as either successes or failures.
  2. The average number of successes that occurs in a specified region is known.
  3. The probability that a success will occur is proportional to the size of the region.
  4. The probability that a success will occur in an extremely small region is virtually zero.

Continue reading

Manual linear regression analysis using R

Updated 2014 June 24th

The aim of linear regression is to find the equation of the straight line that fits the data points the best; the best line is one that minimises the sum of squared residuals of the linear regression model. The equation of a straight line is:

y = mx + b

where m is the slope or gradient and b is the y-intercept.

Continue reading

Step by step principal components analysis using R

I've always wondered about the calculations behind a principal components analysis (PCA). An extremely useful tutorial explains the key concepts and runs through the analysis. Here I use R to calculate each step of a PCA in hopes of better understanding the analysis.

x <- c(2.5, 0.5, 2.2, 1.9, 3.1, 2.3, 2, 1, 1.5, 1.1)
y <- c(2.4, 0.7, 2.9, 2.2, 3.0, 2.7, 1.6, 1.1, 1.6, 0.9)
#[1] 1.81
#[1] 1.91
x_less_mean <- x - mean(x)
# [1]  0.69 -1.31  0.39  0.09  1.29  0.49  0.19 -0.81 -0.31 -0.71
y_less_mean <- y - mean(y)
# [1]  0.49 -1.21  0.99  0.29  1.09  0.79 -0.31 -0.81 -0.31 -1.01
#[1] 0.6154444
#[1] 0.6165556
#[1] 0.7165556
covar_matrix <- matrix(c(cov(x,x),cov(x,y),cov(y,x),cov(y,y)),nrow=2,ncol=2,byrow=TRUE,dimnames=list(c("x","y"),c("x","y")))
#          x         y
#x 0.6165556 0.6154444
#y 0.6154444 0.7165556
#[1] 1.2840277 0.0490834
#          [,1]       [,2]
#[1,] 0.6778734 -0.7351787
#[2,] 0.7351787  0.6778734
e <- eigen(covar_matrix)
#to get the first principal components
#multiple the first eigenvector to mean transformed x, y values
for (i in 1:length(x_less_mean)){
   pc1 <- (x_less_mean[i] * e$vectors[1,1]) + (y_less_mean[i] * e$vectors[2,1])
#[1] 0.8279702
#[1] -1.77758
#[1] 0.9921975
#[1] 0.2742104
#[1] 1.675801
#[1] 0.9129491
#[1] -0.09910944
#[1] -1.144572
#[1] -0.4380461
#[1] -1.223821
for (i in 1:length(x_less_mean)){
   pc2 <- (x_less_mean[i] * e$vectors[1,2]) + (y_less_mean[i] * e$vectors[2,2])
#[1] -0.1751153
#[1] 0.1428572
#[1] 0.384375
#[1] 0.1304172
#[1] -0.2094985
#[1] 0.1752824
#[1] -0.3498247
#[1] 0.04641726
#[1] 0.01776463
#[1] -0.1626753
z <- array(0, dim=c(10,2))
for (i in 1:length(x_less_mean)){
   pc1 <- (x_less_mean[i] * e$vectors[1,1]) + (y_less_mean[i] * e$vectors[2,1])
   pc2 <- (x_less_mean[i] * e$vectors[1,2]) + (y_less_mean[i] * e$vectors[2,2])
   z[i,1] <- pc1
   z[i,2] <- pc2
#             [,1]        [,2]
# [1,]  0.82797019 -0.17511531
# [2,] -1.77758033  0.14285723
# [3,]  0.99219749  0.38437499
# [4,]  0.27421042  0.13041721
# [5,]  1.67580142 -0.20949846
# [6,]  0.91294910  0.17528244
# [7,] -0.09910944 -0.34982470
# [8,] -1.14457216  0.04641726
# [9,] -0.43804614  0.01776463
#[10,] -1.22382056 -0.16267529
#now do this using the inbuilt prcomp() function
data <- data.frame(x,y)
data.pca <- prcomp(data)
#Standard deviations:
#[1] 1.1331495 0.2215477
#         PC1        PC2
#x -0.6778734  0.7351787
#y -0.7351787 -0.6778734
#              PC1         PC2
# [1,] -0.82797019  0.17511531
# [2,]  1.77758033 -0.14285723
# [3,] -0.99219749 -0.38437499
# [4,] -0.27421042 -0.13041721
# [5,] -1.67580142  0.20949846
# [6,] -0.91294910 -0.17528244
# [7,]  0.09910944  0.34982470
# [8,]  1.14457216 -0.04641726
# [9,]  0.43804614 -0.01776463
#[10,]  1.22382056  0.16267529
#reset par()

I'm still trying to figure out why the signs are inverted between calculating the principal components manually vs. using prcomp(). I guess the relationships don't change but was curious as to why this was the case. However in the tutorial listed at the start of this post, first principal components are identical however the second principal components are inverted.