Finding genes with co-expression patterns

Can the R bioconductor package "WGCNA" find artefactually created modules?

Firstly some (subpar) code to generate an artefactual list of genes with co-expression patterns (modules):

#!/usr/bin/perl

use strict;
use warnings;

my $usage = "Usage $0 <number of samples> <number of rows> <number of modules>\n";
my $number_of_sample = shift or die $usage;
my $number_of_row = shift or die $usage;
my $number_of_module = shift or die $usage;

my @master_pattern = ();

#print header
my $header = "id\t";
foreach my $sample (1 .. $number_of_sample){
   my $sample_name = 'sample' . $sample;
   $header .= "$sample_name\t";
}
$header =~ s/\t$//;
print "$header\n";
#end printing header

my $chunk = sprintf("%.0f",$number_of_row / $number_of_module);
for (my $i = 1; $i <= $number_of_row; $i += $chunk){
   #determine module pattern
   my @pattern = ();
   for (1 .. $number_of_sample){
      #randomly generate 0 or 1
      my $rand = int(rand(2));
      push(@pattern,$rand);
   }
   my $pattern = "@pattern";
   push(@master_pattern,$pattern);
   #go in chunks or modules
   for (my $j = $i; $j < $i + $chunk; ++$j){
      my $p = join('',@pattern);
      my $expression = "${j}_$p\t";
      foreach my $pattern (@pattern){
         my $random_expression = random_expression($pattern);
         $expression .= "$random_expression\t";
      }
      $expression =~ s/\t$//;
      print "$expression\n";
   }
}

warn("Patterns:\n");
foreach my $pattern (@master_pattern){
   warn("$pattern\n");
}

sub random_expression {
   my ($which) = @_;
   my $rand = '';
   if ($which == '0'){
      $rand = '1000';
      while ($rand >= 500){
         $rand = int(rand(1000));
      }
   } else {
      $rand = '0';
      while ($rand <= 500){
         $rand = int(rand(1000));
      }
   }
  return $rand;
}

Running the code:

./generate_random_module.pl 10 1000 20 > 10_sample_1000_list_20_module.tsv

Patterns:
1 0 0 0 0 1 0 0 0 1
0 0 0 0 0 0 1 0 1 1
0 0 1 0 0 0 1 0 0 1
1 0 0 0 0 1 1 1 0 1
0 1 0 0 1 0 0 1 1 0
0 0 0 0 1 1 1 1 0 0
1 1 1 0 0 0 0 0 1 0
1 1 1 0 0 0 0 0 0 0
1 1 1 1 1 1 1 1 0 1
0 0 1 1 1 0 1 1 1 0
0 1 1 0 0 0 0 0 1 1
0 1 0 0 1 0 0 1 1 0
0 0 1 1 0 0 1 0 1 0
0 0 0 0 1 1 1 1 0 1
0 1 0 0 1 1 0 0 0 1
1 0 0 1 1 1 0 1 1 0
0 1 1 1 1 1 1 0 1 0
0 1 0 1 0 0 1 1 1 0
0 1 1 0 1 0 1 0 0 0
0 1 1 1 0 1 1 0 1 1

So there are 20 artefactually created co-expression patterns (1 for over expression and 0 for under expression). Some patterns look very similar.

library("WGCNA")

options(stringsAsFactors=FALSE)

data = read.table("10_sample_1000_list_20_module.tsv",sep="\t",header=TRUE)

test = as.data.frame(t(data[,-c(1)]))

names(test) = rownames(data)

rownames(test) = names(data)[-c(1)]

#gsg = goodSamplesGenes(test,verbose=3)

#gsg$allOK

sampleTree = flashClust(dist(test),method="average")

png (filename="clustering.png", width=2000, height=2000, pointsize=12)
par(cex=0.6)
par(mar=c(0,4,2,0))
plot(sampleTree, main="Sample clustering", sub="", xlab="", cex.lab=1.5,cex.axis=1.5,cex.main=2)
abline(h=40000,col="red")
dev.off ()  ##  Close the image

#powers = c(c(1:10), seq(from=12,to=20,by=2))
#
#sft = pickSoftThreshold(test,powerVector=powers, verbose=5)
#
net = blockwiseModules(test,power=6,minModuleSize=30,reassignThreshold=0, mergeCutHeight=0.25,numericLabels=TRUE,pamRespectsDendro=FALSE,saveTOMs=TRUE,saveTOMFileBase="testTOM",verbose=3)
#
table(net$colors)
#
#sizeGrWindow(12,9)
#
mergedColors = labels2colors(net$colors)
#
png (filename="plot_dendro_and_colors.png", width=2000, height=2000, pointsize=12)
plotDendroAndColors(net$dendrograms[[1]], mergedColors[net$blockGenes[[1]]], "Module colors", dendroLabels= FALSE, hang=0.03, addGuide = TRUE, guideHang = 0.05)
dev.off()
#
moduleLabels = net$colors
#
moduleColors = labels2colors(net$colors)
#
MEs = net$MEs

geneTree = net$dendrograms[[1]]

save(MEs, moduleLabels, moduleColors, geneTree, net, file = "number_2.RData")

A total of 15 modules are detected:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
108 106 96 94 94 61 56 55 54 54 53 50 46 43 30

Some of the patterns were very similar and were probably merged together, hence 15 out of the 20 artificial modules were detected. Also note that over expression was generated by randomly picking a number between 500 and 1000 and under expression was from 1 - 500. Looks pretty good!

RefSeq promoters

Is there any nucleotide bias with the -40 region of RefSeqs?

Taking all hg19 RefSeqs that mapped to assembled chromosomes (36,004) and extracting the nucleotide sequences 40 bp upstream of the RefSeq gene model, I generated a sequence logo.

No obvious TATA box enrichment, which was expected since only 10-20% of genes in eukaryotes have a TATA box (perhaps at -13 to -16?). Note the enrichment of a cytosine at -1.

Then I took the -20 and +20 sequences and generated the same sequence logo plot.

Note the enrichment of purines (adenine and guanine) at the 5' UTR start (position 21).

Making a heatmap with R

Update 28th March 2017: I recommend using the superheat package for creating heatmaps.

Heatmaps are great for visualising large tables of data; they are definitely popular in many transcriptome papers. Here are the basic commands for making your own heatmap:

data <- read.table("test.txt",sep="\t",header=TRUE,row.names=1)
data_matrix <- data.matrix(data)
heatmap(data_matrix)
#For more information see help(heatmap)
#If you want cooler colours than the default
install.packages("RColorBrewer")
library("RColorBrewer")
#display all colour schemes
display.brewer.all()
heatmap(data_matrix,col=brewer.pal(9,"Blues"))
#if you want to preserve the column order
#since the order may be informative
heatmap(data_matrix,Colv=NA,col=brewer.pal(9,"Blues"))

display_brewer_allThe colour schemes from the function display.brewer.all()

Returning the values used for the heatmap

Taken from the email thread: [BioC] heatmap.2() get matrix after hierarchical clustering


## Some input sample matrix
y <- matrix(rnorm(50), 10, 5, dimnames=list(paste("g", 1:10, sep=""), paste("t", 1:5, sep="")))

y
             t1          t2          t3           t4         t5
g1   0.09010160  0.18099063  0.06491397  0.976286072 -0.1365132
g2   1.46478627  0.07811639  0.92728521 -0.006493938 -0.4911211
g3   0.31297844 -1.56484613 -0.85887941  0.353068737 -1.0221706
g4   0.09407695  0.43792430  0.46242189  1.491145523 -0.7022191
g5   0.69112967  0.36843583  1.19093386 -0.032736910  1.8298866
g6  -1.02819430 -0.43686436  0.49728848 -0.120586171 -2.9141145
g7   0.47593269 -1.87135877  1.02256261 -1.036563429  0.8059392
g8  -0.10274882 -0.50991134  1.38989532  0.416684239 -0.4231279
g9  -1.17358163 -0.93794222 -0.83479604 -0.872205800  0.1767881
g10  0.23001732  0.10358520  0.69702767 -1.797501815 -0.4973721

## Run heatmap.2 on this matrix
install.packages("gplots")
library(gplots)
test <- heatmap.2(y)
y[rev(test$rowInd), test$colInd]

            t5          t2           t4          t1          t3
g5   1.8298866  0.36843583 -0.032736910  0.69112967  1.19093386
g7   0.8059392 -1.87135877 -1.036563429  0.47593269  1.02256261
g10 -0.4973721  0.10358520 -1.797501815  0.23001732  0.69702767
g4  -0.7022191  0.43792430  1.491145523  0.09407695  0.46242189
g1  -0.1365132  0.18099063  0.976286072  0.09010160  0.06491397
g8  -0.4231279 -0.50991134  0.416684239 -0.10274882  1.38989532
g2  -0.4911211  0.07811639 -0.006493938  1.46478627  0.92728521
g6  -2.9141145 -0.43686436 -0.120586171 -1.02819430  0.49728848
g3  -1.0221706 -1.56484613  0.353068737  0.31297844 -0.85887941
g9   0.1767881 -0.93794222 -0.872205800 -1.17358163 -0.83479604

## Row clustering (adjust here distance/linkage methods to what you need!)
hr <- hclust(as.dist(1-cor(t(y), method="pearson")), method="complete")

## Column clustering (adjust here distance/linkage methods to what you need!)
hc <- hclust(as.dist(1-cor(y, method="spearman")), method="complete")

## Plot heatmap
heatmap.2(y, Rowv=as.dendrogram(hr), Colv=as.dendrogram(hc), scale="row", density.info="none", trace="none")

## Return matrix with row/column sorting as in heatmap
y[rev(hr$labels[hr$order]), hc$labels[hc$order]]

             t2           t4          t1          t3         t5
g7  -1.87135877 -1.036563429  0.47593269  1.02256261  0.8059392
g5   0.36843583 -0.032736910  0.69112967  1.19093386  1.8298866
g9  -0.93794222 -0.872205800 -1.17358163 -0.83479604  0.1767881
g10  0.10358520 -1.797501815  0.23001732  0.69702767 -0.4973721
g2   0.07811639 -0.006493938  1.46478627  0.92728521 -0.4911211
g4   0.43792430  1.491145523  0.09407695  0.46242189 -0.7022191
g1   0.18099063  0.976286072  0.09010160  0.06491397 -0.1365132
g3  -1.56484613  0.353068737  0.31297844 -0.85887941 -1.0221706
g8  -0.50991134  0.416684239 -0.10274882  1.38989532 -0.4231279
g6  -0.43686436 -0.120586171 -1.02819430  0.49728848 -2.9141145

test <- heatmap.2(y)

heatmap.2(y, Rowv=as.dendrogram(hr), Colv=as.dendrogram(hc), scale="row", density.info="none", trace="none")

An example of creating a heatmap using transcriptome data

I will use the TagSeqExample.tab file from DESeq.

#install DESeq if necessary
source("http://bioconductor.org/biocLite.R")
biocLite("DESeq")
#load DESeq
library("DESeq")
example_file <- system.file ("extra/TagSeqExample.tab", package="DESeq")
data <- read.delim(example_file, header=T, row.names="gene")
head(data)
           T1a T1b  T2  T3  N1  N2
Gene_00001   0   0   2   0   0   1
Gene_00002  20   8  12   5  19  26
Gene_00003   3   0   2   0   0   0
Gene_00004  75  84 241 149 271 257
Gene_00005  10  16   4   0   4  10
Gene_00006 129 126 451 223 243 149
nrow(data)
[1] 18760
#check distribution of row sums
quantile(rowSums(data))
    0%    25%    50%    75%   100% 
     0     31    228    883 234506
#create a workable set
data_subset <- data[rowSums(data)>50000,]
nrow(data_subset)
[1] 49
data_matrix <- data.matrix(data_subset)
heatmap(data_matrix)

Now using heatmap.2():

#install if necessary
install.packages("gplots")
library(gplots)
heatmap.2(data_matrix)

heatmap.2(data_matrix,scale="row")

#using a red and blue colour scheme without traces and scaling each row
library("RColorBrewer")
heatmap.2(data_matrix,col=brewer.pal(11,"RdBu"),scale="row", trace="none")

deseq_heatmap2_rdbu

#picking your own colours
#black and red
colfunc <- colorRampPalette(c("black", "red"))
heatmap.2(data_matrix,col=colfunc(15),scale="row", trace="none")

heatmap_2_black_red

#picking your own colours
#black, white and red
#to see all colours type colours()
colfunc <- colorRampPalette(c("black", "white", "red"))
heatmap.2(data_matrix,col=colfunc(15),scale="row", trace="none")

heatmap_2_black_white_redPerhaps you can pick better colours than me!.

Manually checking some of the values in heatmap.2()

#The first row in the heatmap
data_matrix["Gene_08743",]
  T1a   T1b    T2    T3    N1    N2 
10404  4626 32103 14288 19626 16079 
summary(data_matrix["Gene_08743",])
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   4626   11380   15180   16190   18740   32100
test <- heatmap.2(data_matrix,scale="row")
#create function to calculate z-score
z_score <- function(x){
+    (x-mean(x))/sd(x)
+ }
z_score(data_matrix["Gene_08743",])
        T1a         T1b          T2          T3          N1          N2 
-0.61945966 -1.23831240  1.70461190 -0.20346381  0.36826272 -0.01163874
#compare the manually calculating values with the ones used for the heatmap
#they are the same, except the ordering
test$carpet[,"Gene_08743"]
         N2         T1b         T1a          T2          T3          N1 
-0.01163874 -1.23831240 -0.61945966  1.70461190 -0.20346381  0.36826272

Forking in Perl

Foreach item in the array, start a fork. $pid returns 0 if it is the child process. This way you can spawn 4 child processes from one parent. This page explains it all and the code shown below is an adaptation of code shown from the link. I just added code to show the current process id and the process id of the child.

#!/usr/bin/perl

use strict;
use warnings;

my @seq = qw/one two three four/;

my @childs = ();

warn "Starting on \n: created new fork on $pid\n";
   if ($pid) {
      # parent
      warn "\tPushing $pid into array\n";
      push(@childs, $pid);
   } elsif ($pid == 0) {
      # child
      warn "\tRunning child process for $seq\n";
      exit(0);
   } else {
      die "Couldn't fork: $!\n";
   }
}

foreach my $pid (@childs) {
   waitpid($pid, 0);
}

print "Done\n";