Tissue specificity

Wikipedia has a definition of entropy with respect to information theory. The introduction of that article gives an example using a coin toss; if a coin toss is fair, the entropy rate for a fair coin toss is one bit per toss. However, if the coin is not fair, then the uncertainty, and hence the entropy rate, is lower.

The formula for the Shannon entropy is:

-\sum_{i=1}^n p(x_{i}) log_{b}p(x_{i})

Let's test this out:

Continue reading

Mapping repeats 2

Updated 10th September 2013 to include LAST

I previously looked at mapping repeats with respect to sequencing errors in high throughput sequencing and as one would expect, the accuracy of the mapping decreased when sequencing errors were introduced. I then looked at aligning to unique regions of the genome to get an idea of how different short read alignment perform with reads that should map uniquely. Here I combine the two ideas, to see how different short read alignment programs perform when mapping repeats.

When I wrote my first mapping repeats post, I was made aware of this review article on "Repetitive DNA and next-generation sequencing: computational challenges and solutions" via Twitter (thank you CB). It was also his suggestion that it would be interesting to compare different short read alignment programs with respect to mapping repeats, hence this post.

Continue reading

Double square brackets in R

This deserved its own post because I had some difficulty understanding the double square brackets in R. If we search for "double square brackets in R" we come across this tutorial, which shows us that the double square brackets, i.e. [[]], can be used to directly access columns:

head(iris)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa
#vector of sepal lengths using the column name
iris[['Sepal.Length']]
  [1] 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9 5.4 4.8 4.8 4.3 5.8 5.7 5.4 5.1 5.7 5.1 5.4 5.1
 [23] 4.6 5.1 4.8 5.0 5.0 5.2 5.2 4.7 4.8 5.4 5.2 5.5 4.9 5.0 5.5 4.9 4.4 5.1 5.0 4.5 4.4 5.0
 [45] 5.1 4.8 5.1 4.6 5.3 5.0 7.0 6.4 6.9 5.5 6.5 5.7 6.3 4.9 6.6 5.2 5.0 5.9 6.0 6.1 5.6 6.7
 [67] 5.6 5.8 6.2 5.6 5.9 6.1 6.3 6.1 6.4 6.6 6.8 6.7 6.0 5.7 5.5 5.5 5.8 6.0 5.4 6.0 6.7 6.3
 [89] 5.6 5.5 5.5 6.1 5.8 5.0 5.6 5.7 5.7 6.2 5.1 5.7 6.3 5.8 7.1 6.3 6.5 7.6 4.9 7.3 6.7 7.2
[111] 6.5 6.4 6.8 5.7 5.8 6.4 6.5 7.7 7.7 6.0 6.9 5.6 7.7 6.3 6.7 7.2 6.2 6.1 6.4 7.2 7.4 7.9
[133] 6.4 6.3 6.1 7.7 6.3 6.4 6.0 6.9 6.7 6.9 5.8 6.8 6.7 6.7 6.3 6.5 6.2 5.9
#vector of sepal lengths using the column index
iris[[1]]
  [1] 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9 5.4 4.8 4.8 4.3 5.8 5.7 5.4 5.1 5.7 5.1 5.4 5.1
 [23] 4.6 5.1 4.8 5.0 5.0 5.2 5.2 4.7 4.8 5.4 5.2 5.5 4.9 5.0 5.5 4.9 4.4 5.1 5.0 4.5 4.4 5.0
 [45] 5.1 4.8 5.1 4.6 5.3 5.0 7.0 6.4 6.9 5.5 6.5 5.7 6.3 4.9 6.6 5.2 5.0 5.9 6.0 6.1 5.6 6.7
 [67] 5.6 5.8 6.2 5.6 5.9 6.1 6.3 6.1 6.4 6.6 6.8 6.7 6.0 5.7 5.5 5.5 5.8 6.0 5.4 6.0 6.7 6.3
 [89] 5.6 5.5 5.5 6.1 5.8 5.0 5.6 5.7 5.7 6.2 5.1 5.7 6.3 5.8 7.1 6.3 6.5 7.6 4.9 7.3 6.7 7.2
[111] 6.5 6.4 6.8 5.7 5.8 6.4 6.5 7.7 7.7 6.0 6.9 5.6 7.7 6.3 6.7 7.2 6.2 6.1 6.4 7.2 7.4 7.9
[133] 6.4 6.3 6.1 7.7 6.3 6.4 6.0 6.9 6.7 6.9 5.8 6.8 6.7 6.7 6.3 6.5 6.2 5.9
#the double square brackets in R can also be used
#with the single square brackets
iris[[1]][2]
[1] 4.9

Continue reading

Distance Matrix Computation

Here I demonstrate the distance matrix computations using the R function dist().

Firstly let's prepare a small dataset to work with:

#set seed to make example reproducible
set.seed(123)
test <- data.frame(x=sample(1:10000,7), 
                   y=sample(1:10000,7), 
                   z=sample(1:10000,7))
test
     x    y    z
1 2876 8925 1030
2 7883 5514 8998
3 4089 4566 2461
4 8828 9566  421
5 9401 4532 3278
6  456 6773 9541
7 5278 5723 8891

How does this dataset look in 3 dimensional space?

s3d <- scatterplot3d(test, color=1:7, pch=19, type="p")
s3d.coords <- s3d$xyz.convert(test)
text(s3d.coords$x, s3d.coords$y, labels=row.names(test), cex=1, pos=4)

s3dWe can see that points 4 and 6 are quite far away from each other.

Continue reading

Set notation

I've just started the Mathematical Biostatistics Boot Camp 1 and to help me remember the set notations introduced in the first lecture, I'll include them here:

The sample space, \Omega (upper case omega), is the collection of possible outcomes of an experiment, such as a die roll:

\Omega = \{1, 2, 3, 4, 5, 6\}

An event, say E, is a subset of \Omega, such as the even dice rolls:

E = \{2, 4, 6\}

An elementary or simple event is a particular result of an experiment, such as the roll of 4 (represented as a lowercase omega):

\omega = 4

A null event or the empty set is represented as \emptyset.

Continue reading