A Markov chain is a mathematical system that undergoes transitions from one state to another on a state space in a stochastic (random) manner. Examples of Markov chains include the board game snakes and ladders, where each state represents the position of a player on the board and a player moves between states (different positions on the board) by rolling a dice. An important property of Markov chains, called the Markov property, is the memoryless property of the stochastic process. Basically what this means is that the transition between states depends only on the current state and not on the states preceding the current state; in terms of the board game, your next position on the board depends only on where you are currently positioned and not on the sequence of moves that got you there. Another way of thinking about it is that **the future is independent of the past, given the present**.

# Tag Archives: statistics

# Tissue specificity

Wikipedia has a definition of entropy with respect to information theory. The introduction of that article gives an example using a coin toss; if a coin toss is fair, the entropy rate for a fair coin toss is one bit per toss. However, if the coin is not fair, then the uncertainty, and hence the entropy rate, is lower.

The formula for the Shannon entropy is:

Let's test this out:

# Set notation

I've just started the Mathematical Biostatistics Boot Camp 1 and to help me remember the set notations introduced in the first lecture, I'll include them here:

The **sample space**, (upper case omega), is the collection of possible outcomes of an experiment, such as a die roll:

An **event**, say E, is a subset of , such as the even dice rolls:

An **elementary** or **simple** event is a particular result of an experiment, such as the roll of 4 (represented as a lowercase omega):

A **null event** or the **empty set** is represented as .

# Comparing different distributions

I recently learned of the Kolmogorov-Smirnov Test and how one can use it to test whether two datasets are likely to be different, i.e. comparing different distributions. Strictly speaking, the p-value gives us a probability of whether or not we can reject the null hypothesis, which is that two datasets have the same distribution. Using R:

#example 1 set.seed(123) x <- rpois(n=1000,lambda=100) set.seed(234) y <- rpois(n=1000,lambda=100) ks.test(x,y) Two-sample Kolmogorov-Smirnov test data: x and y D = 0.036, p-value = 0.5361 alternative hypothesis: two-sided Warning message: In ks.test(x, y) : p-value will be approximate in the presence of ties #example 2 #how about having a large population #and drawing different samples for comparison set.seed(123) pop <- rpois(n=1000000, lambda=1000) summary(pop) Min. 1st Qu. Median Mean 3rd Qu. Max. 848 979 1000 1000 1021 1165 #sample 1 set.seed(123) sample_1 <- sample(x=pop, size=1000) #add some noise/jitter sample_1 <- jitter(sample_1, factor=10) #sample_2 set.seed(234) sample_2 <- sample(x=pop, size=1000) #add some noise/jitter sample_2 <- jitter(sample_2, factor=10) ks.test(sample_1,sample_2) Two-sample Kolmogorov-Smirnov test data: sample_1 and sample_2 D = 0.026, p-value = 0.8879 alternative hypothesis: two-sided

In both examples, we cannot reject the null hypothesis that the two distributions are different. Since in the first example they are both from a Poisson distribution with the same lambda and in the second example, the samples were drawn from the same population.

set.seed(12345) x <- rpois(n=1000,lambda=100) set.seed(23456) y <- rnorm(n=1000,mean=100) ks.test(x,y) Two-sample Kolmogorov-Smirnov test data: x and y D = 0.433, p-value < 2.2e-16 alternative hypothesis: two-sided Warning message: In ks.test(x, y) : p-value will be approximate in the presence of ties

In this case we get an extremely low p-value, and we can reject the null, which is that both the distributions are the same and they are not (one is a Normal distribution and the other a Poisson distribution).

The warning messages are due to the implementation of the KS test in R, which expects a continuous distribution and thus there should not be any identical values in the two datasets i.e. ties. I've read several sources and they all mention that the KS test can deal with both discrete and continuous data (I'm guessing because it mainly deals with cumulative quantiles) but I'm not sure about the implementation in R.

Related http://www.physics.csbsju.edu/stats/KS-test.html

### Testing for normality

On a related topic, how can we check whether our dataset is normally distributed? One can use use the QQ-normal plot to check if the values lie on a straight diagonal line.

set.seed(123) norm_data <- rnorm(n=1000,mean=200,sd=50) qqnorm(norm_data) qqline(norm_data)

set.seed(123) gamma_data <- rgamma(n=1000,shape=1) qqnorm(gamma_data) qqline(gamma_data)

What about some statistical test? Let's try the Shapiro-Wilk test, which tests the null hypothesis that a sample came from a normally distributed population.

#we can't reject the null, since it really came from a normally distribution population shapiro.test(norm_data) Shapiro-Wilk normality test data: norm_data W = 0.9984, p-value = 0.4765 #we can reject the null, since the values were from a gamma distribution shapiro.test(gamma_data) Shapiro-Wilk normality test data: gamma_data W = 0.8146, p-value < 2.2e-16 #ks.test reject null that both are the same ks.test(norm_data, gamma_data) Two-sample Kolmogorov-Smirnov test data: norm_data and gamma_data D = 1, p-value < 2.2e-16 alternative hypothesis: two-sided

See this useful discussion on StackExchange on testing for normality.

# The Poisson distribution

A Poisson distribution is the probability distribution that results from a Poisson experiment. A probability distribution assigns a probability to possible outcomes of a random experiment. A Poisson experiment has the following properties:

- The outcomes of the experiment can be classified as either successes or failures.
- The average number of successes that occurs in a specified region is known.
- The probability that a success will occur is proportional to the size of the region.
- The probability that a success will occur in an extremely small region is virtually zero.

# Manual linear regression analysis using R

Updated 2014 June 24th

The aim of linear regression is to find the equation of the straight line that fits the data points the best; the best line is one that minimises the sum of squared residuals of the linear regression model. The equation of a straight line is:

where is the slope or gradient and is the y-intercept.

# Step by step principal components analysis using R

I've always wondered about the calculations behind a principal components analysis (PCA). An extremely useful tutorial explains the key concepts and runs through the analysis. Here I use R to calculate each step of a PCA in hopes of better understanding the analysis.

x <- c(2.5, 0.5, 2.2, 1.9, 3.1, 2.3, 2, 1, 1.5, 1.1) y <- c(2.4, 0.7, 2.9, 2.2, 3.0, 2.7, 1.6, 1.1, 1.6, 0.9) mean(x) #[1] 1.81 mean(y) #[1] 1.91 x_less_mean <- x - mean(x) x_less_mean # [1] 0.69 -1.31 0.39 0.09 1.29 0.49 0.19 -0.81 -0.31 -0.71 y_less_mean <- y - mean(y) y_less_mean # [1] 0.49 -1.21 0.99 0.29 1.09 0.79 -0.31 -0.81 -0.31 -1.01 cov(x,y) #[1] 0.6154444 cov(x,x) #[1] 0.6165556 cov(y,y) #[1] 0.7165556 covar_matrix <- matrix(c(cov(x,x),cov(x,y),cov(y,x),cov(y,y)),nrow=2,ncol=2,byrow=TRUE,dimnames=list(c("x","y"),c("x","y"))) covar_matrix # x y #x 0.6165556 0.6154444 #y 0.6154444 0.7165556 eigen(covar_matrix) #$values #[1] 1.2840277 0.0490834 # #$vectors # [,1] [,2] #[1,] 0.6778734 -0.7351787 #[2,] 0.7351787 0.6778734 e <- eigen(covar_matrix) #to get the first principal components #multiple the first eigenvector to mean transformed x, y values for (i in 1:length(x_less_mean)){ pc1 <- (x_less_mean[i] * e$vectors[1,1]) + (y_less_mean[i] * e$vectors[2,1]) print(pc1) } #[1] 0.8279702 #[1] -1.77758 #[1] 0.9921975 #[1] 0.2742104 #[1] 1.675801 #[1] 0.9129491 #[1] -0.09910944 #[1] -1.144572 #[1] -0.4380461 #[1] -1.223821 for (i in 1:length(x_less_mean)){ pc2 <- (x_less_mean[i] * e$vectors[1,2]) + (y_less_mean[i] * e$vectors[2,2]) print(pc2) } #[1] -0.1751153 #[1] 0.1428572 #[1] 0.384375 #[1] 0.1304172 #[1] -0.2094985 #[1] 0.1752824 #[1] -0.3498247 #[1] 0.04641726 #[1] 0.01776463 #[1] -0.1626753 z <- array(0, dim=c(10,2)) for (i in 1:length(x_less_mean)){ pc1 <- (x_less_mean[i] * e$vectors[1,1]) + (y_less_mean[i] * e$vectors[2,1]) pc2 <- (x_less_mean[i] * e$vectors[1,2]) + (y_less_mean[i] * e$vectors[2,2]) z[i,1] <- pc1 z[i,2] <- pc2 } z # [,1] [,2] # [1,] 0.82797019 -0.17511531 # [2,] -1.77758033 0.14285723 # [3,] 0.99219749 0.38437499 # [4,] 0.27421042 0.13041721 # [5,] 1.67580142 -0.20949846 # [6,] 0.91294910 0.17528244 # [7,] -0.09910944 -0.34982470 # [8,] -1.14457216 0.04641726 # [9,] -0.43804614 0.01776463 #[10,] -1.22382056 -0.16267529 #now do this using the inbuilt prcomp() function data <- data.frame(x,y) data.pca <- prcomp(data) data.pca #Standard deviations: #[1] 1.1331495 0.2215477 # #Rotation: # PC1 PC2 #x -0.6778734 0.7351787 #y -0.7351787 -0.6778734 data.pca$x # PC1 PC2 # [1,] -0.82797019 0.17511531 # [2,] 1.77758033 -0.14285723 # [3,] -0.99219749 -0.38437499 # [4,] -0.27421042 -0.13041721 # [5,] -1.67580142 0.20949846 # [6,] -0.91294910 -0.17528244 # [7,] 0.09910944 0.34982470 # [8,] 1.14457216 -0.04641726 # [9,] 0.43804614 -0.01776463 #[10,] 1.22382056 0.16267529 op<-par(mfrow=c(1,2)) plot(z) plot(data.pca$x) #reset par() par(mfrow=c(1,1))

I'm still trying to figure out why the signs are inverted between calculating the principal components manually vs. using prcomp(). I guess the relationships don't change but was curious as to why this was the case. However in the tutorial listed at the start of this post, first principal components are identical however the second principal components are inverted.