Updated 2017 October 14th

A key measure in information theory is entropy, which is **the amount of uncertainty involved in a random process**; the lower the uncertainty, the lower the entropy. For example, there is lower entropy in a fair coin flip versus a fair die roll since there are more possible outcomes with a die roll {1, 2, 3, 4, 5, 6} compared to a coin flip {H, T}. Entropy is measured in bits, which has a single binary value of either 1 or 0. Since a coin toss has only two outcomes, each toss has one bit of information. However, if the coin is not fair, meaning that it is biased towards either heads or tails, there is less uncertainty, i.e. lower entropy; if a die lands on heads 60% of the time, we are more certain of heads than in a fair die (50% heads).

There’s a brilliant example on Wikipedia explaining the relationship between entropy (uncertainty) and information content.

For example, quoting a character (the Hippy Dippy Weatherman) of comedian George Carlin, “Weather forecast for tonight: dark. Continued dark overnight, with widely scattered light by morning.” Assuming one not residing near the Earth’s poles or polar circles, the amount of information conveyed in that forecast is zero because it is known, in advance of receiving the forecast, that darkness always comes with the night.

There is no uncertainty in the above statement hence that piece of information has 0 bits.

Mathematically, the Shannon entropy is defined as:

$$!-\sum_{i=1}^n p(x_{i}) log_{b}p(x_{i})$$

Let’s test this out using the coin flip example above:

# 100 fair coin tosses set.seed(123) fair <- rbinom(100,1,0.5) fair [1] 0 1 0 1 1 0 1 1 1 0 1 0 1 1 0 1 0 0 0 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 [49] 0 1 0 0 1 0 1 0 0 1 1 0 1 0 0 0 1 0 1 1 1 0 1 1 1 0 0 0 0 1 0 0 0 1 0 1 0 0 1 1 1 0 0 1 0 1 0 0 [97] 1 0 0 1 # almost 50-50 table(fair) fair 0 1 53 47 # calculate Shannon entropy of fair coin # close to 1 bit of information -( (0.53 * log2(0.53)) + (0.47 * log2(0.47)) ) [1] 0.9974016 # now for an unfair die set.seed(123) unfair <- rbinom(100,1,0.2) unfair [1] 0 0 0 1 1 0 0 1 0 0 1 0 0 0 0 1 0 0 0 1 1 0 0 1 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 [49] 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 [97] 0 0 0 0 table(unfair) unfair 0 1 82 18 # calculate Shannon entropy of unfair coin # less entropy because of less uncertainty -( (0.82 * log2(0.82)) + (0.18 * log2(0.18)) ) [1] 0.680077

What does all this have to do with measuring tissue specificity? I came across this paper: “Promoter features related to tissue specificity as measured by Shannon entropy” and it spurred me to learn about entropy. Basically, if a gene is expressed in a tissue specific manner, we are more certain of its expression and hence there is lower entropy. I’ll go through various expression patterns below.

# first let's define a function to calculate Shannon entropy # note that this is different from the coin toss example, which used the frequencies # this example uses expression strength and # each value is normalised by the sum of the total expression # code from https://stat.ethz.ch/pipermail/r-help/2008-July/167112.html shannon.entropy <- function(p){ if (min(p) < 0 || sum(p) <= 0) return(NA) p.norm <- p[p>0]/sum(p) -sum(log2(p.norm)*p.norm) } # a gene that is evenly expressed across 30 samples set.seed(123) fairly_even_expression <- rnorm(30,50,15) fairly_even_expression [1] 41.59287 46.54734 73.38062 51.05763 51.93932 75.72597 56.91374 31.02408 39.69721 43.31507 [11] 68.36123 55.39721 56.01157 51.66024 41.66238 76.80370 57.46776 20.50074 60.52034 42.90813 [21] 33.98264 46.73038 34.60993 39.06663 40.62441 24.69960 62.56681 52.30060 32.92795 68.80722 shannon.entropy(fairly_even_expression) [1] 4.84333 # a gene that is highly expressed in one sample # with some background expression in other samples set.seed(123) high_expression <- rnorm(29,10,2) high_expression[30] <- 100 high_expression [1] 8.879049 9.539645 13.117417 10.141017 10.258575 13.430130 10.921832 7.469878 [9] 8.626294 9.108676 12.448164 10.719628 10.801543 10.221365 8.888318 13.573826 [17] 10.995701 6.066766 11.402712 9.054417 7.864353 9.564050 7.947991 8.542218 [25] 8.749921 6.626613 11.675574 10.306746 7.723726 100.000000 # not the lower entropy compared with fairly_even_expression shannon.entropy(high_expression) [1] 4.401731 # higher expression in half of the samples high_expression_half <- c(rep(1,15),rep(30,15)) high_expression_half [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 shannon.entropy(high_expression_half) [1] 4.112483 # very specific expression specific_expression <- rep(0,29) specific_expression[30] <- 100 specific_expression [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 [25] 0 0 0 0 0 100 shannon.entropy(specific_expression) [1] 0 # very specific expression in 3 out of 30 samples three_specific <- c(rep(1,27),25,65,100) three_specific [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 [25] 1 1 1 25 65 100 shannon.entropy(three_specific) [1] 2.360925 # equal expression # note that this will be the same regardless of the expression strength equal_expression <- rep(5,30) equal_expression [1] 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 shannon.entropy(equal_expression) [1] 4.906891

Plot the expression patterns for the 6 scenarios.

par(mfrow=c(2,3)) barplot(equal_expression, main=shannon.entropy(equal_expression)) barplot(fairly_even_expression, main=shannon.entropy(fairly_even_expression)) barplot(high_expression, main=shannon.entropy(high_expression)) barplot(high_expression_half, main=shannon.entropy(high_expression_half)) barplot(three_specific, main=shannon.entropy(three_specific)) barplot(specific_expression, main=shannon.entropy(specific_expression))

*I should have labelled the axes*.

### Conclusions

Equal expression amongst the 30 libraries resulted in a Shannon entropy of ~4.91 bits; this is similar to an even coin toss. This is close to 5 bits because we need 5 bits to transfer information on 30 samples. The more specific a gene is expressed, the less uncertainty, and therefore the lower the entropy.

This work is licensed under a Creative Commons

Attribution 4.0 International License.

Actually, Dave, maybe you can refer to John Rinn’s lincRNA paper.

http://genesdev.cshlp.org/content/early/2011/09/02/gad.17446611.abstract

Hi Andy,

Thanks for the paper; I’ve read some of his other work but not this one yet. I’ll definitely have a read.

Cheers,

Dave

Cheers 🙂

Thanks for the post, it’s really interesting! Just a heads-up, the equation for Shannon entropy is not rendering, but instead still visible as raw Latex. Also, the link to the Schug et al. paper is just to the journal’s homepage. Here is the one to the journal article:

https://genomebiology.biomedcentral.com/articles/10.1186/gb-2005-6-4-r33

Thanks,

Josh