Tissue specificity

Updated 2017 October 14th

A key measure in information theory is entropy, which is the amount of uncertainty involved in a random process; the lower the uncertainty, the lower the entropy. For example, there is lower entropy in a fair coin flip versus a fair die roll since there are more possible outcomes with a die roll {1, 2, 3, 4, 5, 6} compared to a coin flip {H, T}. Entropy is measured in bits, which has a single binary value of either 1 or 0. Since a coin toss has only two outcomes, each toss has one bit of information. However, if the coin is not fair, meaning that it is biased towards either heads or tails, there is less uncertainty, i.e. lower entropy; if a die lands on heads 60% of the time, we are more certain of heads than in a fair die (50% heads).

There’s a brilliant example on Wikipedia explaining the relationship between entropy (uncertainty) and information content.

For example, quoting a character (the Hippy Dippy Weatherman) of comedian George Carlin, “Weather forecast for tonight: dark. Continued dark overnight, with widely scattered light by morning.” Assuming one not residing near the Earth’s poles or polar circles, the amount of information conveyed in that forecast is zero because it is known, in advance of receiving the forecast, that darkness always comes with the night.

There is no uncertainty in the above statement hence that piece of information has 0 bits.

Mathematically, the Shannon entropy is defined as:

$$!-\sum_{i=1}^n p(x_{i}) log_{b}p(x_{i})$$

Let’s test this out using the coin flip example above:


# 100 fair coin tosses
set.seed(123)
fair <- rbinom(100,1,0.5)
fair
  [1] 0 1 0 1 1 0 1 1 1 0 1 0 1 1 0 1 0 0 0 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0
 [49] 0 1 0 0 1 0 1 0 0 1 1 0 1 0 0 0 1 0 1 1 1 0 1 1 1 0 0 0 0 1 0 0 0 1 0 1 0 0 1 1 1 0 0 1 0 1 0 0
 [97] 1 0 0 1

# almost 50-50
table(fair)
fair
 0  1 
53 47

# calculate Shannon entropy of fair coin
# close to 1 bit of information
-( (0.53 * log2(0.53)) + (0.47 * log2(0.47)) )
[1] 0.9974016

# now for an unfair die
set.seed(123)
unfair <- rbinom(100,1,0.2)
unfair
  [1] 0 0 0 1 1 0 0 1 0 0 1 0 0 0 0 1 0 0 0 1 1 0 0 1 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 [49] 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0
 [97] 0 0 0 0

table(unfair)
unfair
 0  1 
82 18

# calculate Shannon entropy of unfair coin
# less entropy because of less uncertainty
-( (0.82 * log2(0.82)) + (0.18 * log2(0.18)) )
[1] 0.680077

What does all this have to do with measuring tissue specificity? I came across this paper: “Promoter features related to tissue specificity as measured by Shannon entropy” and it spurred me to learn about entropy. Basically, if a gene is expressed in a tissue specific manner, we are more certain of its expression and hence there is lower entropy. I’ll go through various expression patterns below.

# first let's define a function to calculate Shannon entropy
# note that this is different from the coin toss example, which used the frequencies
# this example uses expression strength and
# each value is normalised by the sum of the total expression
# code from https://stat.ethz.ch/pipermail/r-help/2008-July/167112.html
shannon.entropy <- function(p){
   if (min(p) < 0 || sum(p) <= 0) return(NA)
   p.norm <- p[p>0]/sum(p)
   -sum(log2(p.norm)*p.norm)
}

# a gene that is evenly expressed across 30 samples
set.seed(123)
fairly_even_expression <- rnorm(30,50,15)
fairly_even_expression
 [1] 41.59287 46.54734 73.38062 51.05763 51.93932 75.72597 56.91374 31.02408 39.69721 43.31507
[11] 68.36123 55.39721 56.01157 51.66024 41.66238 76.80370 57.46776 20.50074 60.52034 42.90813
[21] 33.98264 46.73038 34.60993 39.06663 40.62441 24.69960 62.56681 52.30060 32.92795 68.80722
shannon.entropy(fairly_even_expression)
[1] 4.84333

# a gene that is highly expressed in one sample
# with some background expression in other samples
set.seed(123)
high_expression <- rnorm(29,10,2)
high_expression[30] <- 100
high_expression
 [1]   8.879049   9.539645  13.117417  10.141017  10.258575  13.430130  10.921832   7.469878
 [9]   8.626294   9.108676  12.448164  10.719628  10.801543  10.221365   8.888318  13.573826
[17]  10.995701   6.066766  11.402712   9.054417   7.864353   9.564050   7.947991   8.542218
[25]   8.749921   6.626613  11.675574  10.306746   7.723726 100.000000

# not the lower entropy compared with fairly_even_expression
shannon.entropy(high_expression)
[1] 4.401731

# higher expression in half of the samples
high_expression_half <- c(rep(1,15),rep(30,15))
high_expression_half
 [1]  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30
shannon.entropy(high_expression_half)
[1] 4.112483

# very specific expression
specific_expression <- rep(0,29)
specific_expression[30] <- 100
specific_expression
 [1]   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
[25]   0   0   0   0   0 100
shannon.entropy(specific_expression)
[1] 0

# very specific expression in 3 out of 30 samples
three_specific <- c(rep(1,27),25,65,100)
three_specific
 [1]   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1
[25]   1   1   1  25  65 100
shannon.entropy(three_specific)
[1] 2.360925

# equal expression
# note that this will be the same regardless of the expression strength
equal_expression <- rep(5,30)
equal_expression
 [1] 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5
shannon.entropy(equal_expression)
[1] 4.906891

Plot the expression patterns for the 6 scenarios.

par(mfrow=c(2,3))
barplot(equal_expression, main=shannon.entropy(equal_expression))
barplot(fairly_even_expression, main=shannon.entropy(fairly_even_expression))
barplot(high_expression, main=shannon.entropy(high_expression))
barplot(high_expression_half, main=shannon.entropy(high_expression_half))
barplot(three_specific, main=shannon.entropy(three_specific))
barplot(specific_expression, main=shannon.entropy(specific_expression))

expression_pattern_vs_shannon_entropyI should have labelled the axes.

Conclusions

Equal expression amongst the 30 libraries resulted in a Shannon entropy of ~4.91 bits; this is similar to an even coin toss. This is close to 5 bits because we need 5 bits to transfer information on 30 samples. The more specific a gene is expressed, the less uncertainty, and therefore the lower the entropy.

Print Friendly, PDF & Email



Creative Commons License
This work is licensed under a Creative Commons
Attribution 4.0 International License
.
4 comments Add yours
    1. Hi Andy,

      Thanks for the paper; I’ve read some of his other work but not this one yet. I’ll definitely have a read.

      Cheers,

      Dave

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.