BED to GRanges

Updated 2015 April 6th to include the intersect_bed() function in the bedr package.

Last year I saw a post on Writing an R package from scratch and I always wanted to follow the tutorial. Yesterday while trying to make some plots using Gviz, I had some BED-like files (not supported by Gviz), which I wanted to convert into a GRanges object (supported by Gviz). I thought it would be easier if there was a function that could load BED-like files into GRanges objects. I couldn't find one so I thought I'll write a very primitive R package that contains this function. This post is on the creation of the BED to GRanges function and the R package that contains it.

Continue reading

BAM to CRAM

I was catching up on some blog reading and just read Ewan's latest post on CRAM going mainline. CRAM files are alignment files like BAM files but provides better compression. The toolkit for manipulating CRAM files is called CRAMTools and is a set of Java tools and APIs for efficient compression of sequence read data. This is necessary because of the constant increase in the throughput of sequencers. I went through the CRAM toolkit, format specification, the CRAM tutorial on the EBI page and decided to test it out.

Continue reading

Getting started with Picard

Updated hyperlinks on the 2015 January 26th; please comment if you find any more dead links.

Picard is a suite of Java-based command-line utilities that manipulate SAM/BAM files. Currently, I'm analysing some paired-end libraries and I wanted to calculate the average insert size based on the alignments; that's how I found Picard. While reading the documentation I realised that Picard has a lot more to offer and hence I'm writing this post to look into specific tools that I thought would be useful.

To get started, download Picard and an alignment (whole genome shotgun sequencing) from the 1000 genomes project that we can use to test Picard:

Continue reading

I've joined Twitter

Today while reading a paper, I found some interesting one-liner facts. They are way too short to create a post on but I would like to make a repository of them. What better place to store these facts than Twitter!

You can follow me on Twitter for a list of facts on molecular biology and on bioinformatics that I didn't know or have forgotten about over the years.

Update 7th April 2013

I've been using Twitter for almost a year now and have found better use of it then what I had originally planned. I've learned of interesting papers, software and entertaining bits and pieces from the people I follow. I've started to play around with the R TwitteR package and started to read a book on Twitter, called The Tao of Twitter. It's quite business orientated but I'm enjoying the read.

The main concept of the book is describing the Tao, which refers to these three concepts:

  1. Targeted connections
  2. Meaningful content
  3. Authentic helpfulness

where by you can get the most out of Twitter if you network with the right people, share information that is useful to others and are generally perceived as being genuine when it comes to helping others.

Setting up Windows for bioinformatics

Updated 2014 May 20th: I created this post on the 19th of August in 2011 when I just purchased yet another computer (Core i7 2600, 8 gigs ram and a GTX460) and was setting it up for doing some bioinformatics. I updated this page again around mid 2013 when I bought another laptop (Asus N56V, Core i7 3630QM with 12 gigs ram) and was setting it up for work again. I'm updating this page again because I'm seeing some increase in traffic to this page (and thankfully not because I had purchased another computer!).

I use Windows on all of my computers. Using just Windows for bioinformatics is not impossible but it's really just easier to have access to a Linux operating system. In the case of my desktop PC, I have a dual boot setup (Ubuntu and Windows 8) and for my laptop, which came pre-installed with Windows 8 making it a pain to setup Ubuntu, I use VirtualBox to have access to Linux.

Each time I set up a new computer, I always install the list of programs below:

Putty: SSH client
Xming X Server: X Window System Server for Windows
WordWeb: handy dictionary program, which looks up any word you highlight
Launchy: a keystroke program, for quick access to programs
7zip: general purpose zip program
R for Windows
RStudio
Opera: still one of my favourite web browsers and email client
Avast: antivirus
ActivePerl
Cygwin: Linux emulator on Windows
Dropbox: cloud file sharing program
VirtualBox: virtualisation software

My Linux distribution of choice is Ubuntu and using VirtualBox you can have Ubuntu installed inside your Windows installation. Below I outline a list of must have Linux bioinformatic tools for those working in the field of genomics and transcriptomics.

Ubuntu

Download Ubuntu (I would recommend Ubuntu 12.04 LTS). After installing VirtualBox and Ubuntu, here's what I installed immediately:

#VirtualBox guest additions:
sudo ./VBoxLinuxAdditions.run

#zlib (for bwa)
sudo apt-get install zlib*

#download bwa: http://sourceforge.net/projects/bio-bwa/files/
tar -xjf bwa-0.7.5a.tar.bz2
cd bwa-0.7.5a/
make

#install ncurses (for samtools)
sudo apt-get install ncurses-dev

#download SAMTools: http://sourceforge.net/projects/samtools/files/
tar -xjf samtools-0.1.19.tar.bz2
cd samtools-0.1.19/
make

#for BEDTools2
sudo apt-get install build-essential g++
git clone https://github.com/arq5x/bedtools2.git
cd bedtools2
make clean; make all

#FASTX-Toolkit: http://hannonlab.cshl.edu/fastx_toolkit/download.html
wget http://hannonlab.cshl.edu/fastx_toolkit/libgtextutils-0.6.1.tar.bz2
tar -xjf libgtextutils-0.6.1.tar.bz2 
cd libgtextutils-0.6.1/
./configure
make
make check
sudo make install

wget http://hannonlab.cshl.edu/fastx_toolkit/fastx_toolkit-0.0.13.2.tar.bz2
tar -xjf fastx_toolkit-0.0.13.2.tar.bz2
cd fastx_toolkit-0.0.13.2/
./configure
make
sudo make install

#git
sudo apt-get install git-core

For sharing folders click on between VirtualBox and Windows, use Devices/Shared Folders. Afterwards, add your user to the vboxsf group:

sudo adduser `whoami` vboxsf

See also my post on installing R on Ubuntu.

Others

#download your favourite genome
#hg19 for me
wget -O hg19.tar.gz http://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/chromFa.tar.gz
rm *random.fa
rm chrUn_gl0002*
rm *hap*.fa
for file in `ls *.fa | sort -k1V`; do echo $file; cat $file >> hg19.fa; done
rm chr*.fa

#download blat
wget http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/blat/blat

Conclusions

Most personal computers available these days are powerful enough to be running a virtualised installation of an operating system. In my humble opinion, if you're setting up Windows for bioinformatics, the easiest thing to do is just to install VirtualBox and Ubuntu, and installing the bioinformatic programs in that Ubuntu instance.

DESeq vs. edgeR vs. baySeq

6th April 2012: For a more updated version of this post, please refer see this post.

A very simple comparison

Using the TagSeqExample.tab file from the DESeq package as the benchmark dataset. According to DESeq authors, T1a and T1b are similar, so I removed the second column in the file corresponding to T1a:

cut --complement -f2 TagSeqExample.tab > tag_seq_example_less_t1a.tsv

Hierarchical clustering

To get an idea of how similar the libraries are, I performed hierarchical clustering using the Spearman correlation of libraries (see here). Note that these raw reads have not been adjusted for sequencing depth, but hopefully by using a ranked correlation method we can bypass this necessity.

From the dendrogram, we observe the marked difference of the normal samples to the tumour samples.

Installing the R packages

I then installed all the required packages (R version 2.12.0, biocinstall version 2.7.4.):

source("http://www.bioconductor.org/biocLite.R")
biocLite("baySeq")
biocLite("DESeq")
biocLite("edgeR")

DESeq

DESeq code adapted from the vignette:

countsTable <- read.delim("tag_seq_example_less_t1a.tsv", header=TRUE, stringsAsFactors=TRUE)
rownames(countsTable) <- countsTable$gene
countsTable <- countsTable[,-1]
head(countsTable)
conds = c("T","T","T","N","N")
cds<-newCountDataSet(countsTable,conds)
cds<-estimateSizeFactors(cds)
sizeFactors(cds)
T1b T2 T3 N1 N2
0.5587394 1.5823096 1.1270425 1.2869337 0.8746998
cds <- estimateVarianceFunctions (cds)
res <- nbinomTest (cds, "T", "N")
resSig <- res[ res$padj < .1, ]
nrow(resSig)
[1] 1072
write.csv(resSig,file="deseq_res_sig.csv")

edgeR

First transform the tag_seq_example_less_t1a.tsv file to the edgeR format using this script.

tsv_to_edger.pl tag_seq_example_less_t1a.tsv

library(edgeR)
targets = read.delim(file = "targets.txt", stringsAsFactors = FALSE)
d = readDGE(targets,comment.char="#")
d$samples
files group description lib.size norm.factors
1 T1b patient patient 2399311 1
2 T2 patient patient 7202770 1
3 T3 patient patient 5856461 1
4 N1 control control 6376307 1
5 N2 control control 3931277 1

Recently edgeR added a normalisation function into the package called calcNormFactors(). For more information see this paper.

d = calcNormFactors(d)
d$samples
files group description lib.size norm.factors
1 T1b patient patient 2399311 1.1167157
2 T2 patient patient 7202770 0.9978323
3 T3 patient patient 5856461 0.9170402
4 N1 control control 6376307 0.9546010
5 N2 control control 3931277 1.0251550

This approach is different from the one taken by DESeq.

d = estimateCommonDisp(d)
de.com = exactTest(d)
options(digits=4)
topTags(de.com)
detags.com = rownames(topTags(de.com)$table)

sum(de.com$table$p.value < 0.01)
[1] 895

sum(p.adjust(de.com$table$p.value,method="BH") < 0.1)
[1] 593

good = sum(p.adjust(de.com$table$p.value,method="BH") < 0.1)
goodList = topTags(de.com, n=good)
sink("edger.txt")
goodList
sink()

baySeq

library(baySeq)
all = read.delim("tag_seq_example_less_t1a.tsv", header=TRUE, sep="\t")
lib <- c(2399311,7202770,5856461,6376307,3931277)
replicates <- c(1,1,1,2,2)
groups <- list(NDE = c(1,1,1,1,1), DE = c(1,1,1,2,2) )
cname <- all[,1]
all <- all[,-1]
all = as.matrix(all)
CD <- new("countData", data = all, replicates = replicates, libsizes = as.integer(lib), groups = groups)
CD@annotation <- as.data.frame(cname)
cl <- NULL
CDP.NBML <- getPriors.NB(CD, samplesize = 1000, estimation = "QL", cl = cl)
CDPost.NBML <- getLikelihoods.NB(CDP.NBML, pET = 'BIC', cl = cl)
CDPost.NBML@estProps
[1] 0.8897621 0.1102379
topCounts(CDPost.NBML, group=2)
NBML.TPs <- getTPs(CDPost.NBML, group=2, TPs = 1:100)
bayseq_de = topCounts(CDPost.NBML, group=2, number=2000)
write.csv(bayseq_de, file="bayseq_de.csv")

Overlap

R package Number of differentially expressed genes
DESeq 888
edgeR 895
baySeq 1115
Total 18,760

How to make a Venn diagram

DESeq with edgeR 591
DESeq with baySeq 488
edgeR with baySeq 465
DESeq with edgeR with baySeq 338

This is a first draft post and will be continually updated in the future, but I just thought I'll publish it first.