Updated 2015 April 6th to include the intersect_bed() function in the bedr package.
Last year I saw a post on Writing an R package from scratch and I always wanted to follow the tutorial. Yesterday while trying to make some plots using Gviz, I had some BED-like files (not supported by Gviz), which I wanted to convert into a GRanges object (supported by Gviz). I thought it would be easier if there was a function that could load BED-like files into GRanges objects. I couldn't find one so I thought I'll write a very primitive R package that contains this function. This post is on the creation of the BED to GRanges function and the R package that contains it.
I was catching up on some blog reading and just read Ewan's latest post on CRAM going mainline. CRAM files are alignment files like BAM files but provides better compression. The toolkit for manipulating CRAM files is called CRAMTools and is a set of Java tools and APIs for efficient compression of sequence read data. This is necessary because of the constant increase in the throughput of sequencers. I went through the CRAM toolkit, format specification, the CRAM tutorial on the EBI page and decided to test it out.
Updated hyperlinks on the 2015 January 26th; please comment if you find any more dead links.
Picard is a suite of Java-based command-line utilities that manipulate SAM/BAM files. Currently, I'm analysing some paired-end libraries and I wanted to calculate the average insert size based on the alignments; that's how I found Picard. While reading the documentation I realised that Picard has a lot more to offer and hence I'm writing this post to look into specific tools that I thought would be useful.
To get started, download Picard and an alignment (whole genome shotgun sequencing) from the 1000 genomes project that we can use to test Picard:
Today while reading a paper, I found some interesting one-liner facts. They are way too short to create a post on but I would like to make a repository of them. What better place to store these facts than Twitter!
You can follow me on Twitter for a list of facts on molecular biology and on bioinformatics that I didn't know or have forgotten about over the years.
Update 7th April 2013
I've been using Twitter for almost a year now and have found better use of it then what I had originally planned. I've learned of interesting papers, software and entertaining bits and pieces from the people I follow. I've started to play around with the R TwitteR package and started to read a book on Twitter, called The Tao of Twitter. It's quite business orientated but I'm enjoying the read.
The main concept of the book is describing the Tao, which refers to these three concepts:
- Targeted connections
- Meaningful content
- Authentic helpfulness
where by you can get the most out of Twitter if you network with the right people, share information that is useful to others and are generally perceived as being genuine when it comes to helping others.
Updated 2014 May 20th: I created this post on the 19th of August in 2011 when I just purchased yet another computer (Core i7 2600, 8 gigs ram and a GTX460) and was setting it up for doing some bioinformatics. I updated this page again around mid 2013 when I bought another laptop (Asus N56V, Core i7 3630QM with 12 gigs ram) and was setting it up for work again. I'm updating this page again because I'm seeing some increase in traffic to this page (and thankfully not because I had purchased another computer!).
I use Windows on all of my computers. Using just Windows for bioinformatics is not impossible but it's really just easier to have access to a Linux operating system. In the case of my desktop PC, I have a dual boot setup (Ubuntu and Windows 8) and for my laptop, which came pre-installed with Windows 8 making it a pain to setup Ubuntu, I use VirtualBox to have access to Linux.
Each time I set up a new computer, I always install the list of programs below:
Putty: SSH client
Xming X Server: X Window System Server for Windows
WordWeb: handy dictionary program, which looks up any word you highlight
Launchy: a keystroke program, for quick access to programs
7zip: general purpose zip program
R for Windows
Opera: still one of my favourite web browsers and email client
Cygwin: Linux emulator on Windows
Dropbox: cloud file sharing program
VirtualBox: virtualisation software
My Linux distribution of choice is Ubuntu and using VirtualBox you can have Ubuntu installed inside your Windows installation. Below I outline a list of must have Linux bioinformatic tools for those working in the field of genomics and transcriptomics.
Download Ubuntu (I would recommend Ubuntu 12.04 LTS). After installing VirtualBox and Ubuntu, here's what I installed immediately:
#VirtualBox guest additions: sudo ./VBoxLinuxAdditions.run #zlib (for bwa) sudo apt-get install zlib* #download bwa: http://sourceforge.net/projects/bio-bwa/files/ tar -xjf bwa-0.7.5a.tar.bz2 cd bwa-0.7.5a/ make #install ncurses (for samtools) sudo apt-get install ncurses-dev #download SAMTools: http://sourceforge.net/projects/samtools/files/ tar -xjf samtools-0.1.19.tar.bz2 cd samtools-0.1.19/ make #for BEDTools2 sudo apt-get install build-essential g++ git clone https://github.com/arq5x/bedtools2.git cd bedtools2 make clean; make all #FASTX-Toolkit: http://hannonlab.cshl.edu/fastx_toolkit/download.html wget http://hannonlab.cshl.edu/fastx_toolkit/libgtextutils-0.6.1.tar.bz2 tar -xjf libgtextutils-0.6.1.tar.bz2 cd libgtextutils-0.6.1/ ./configure make make check sudo make install wget http://hannonlab.cshl.edu/fastx_toolkit/fastx_toolkit-0.0.13.2.tar.bz2 tar -xjf fastx_toolkit-0.0.13.2.tar.bz2 cd fastx_toolkit-0.0.13.2/ ./configure make sudo make install #git sudo apt-get install git-core
For sharing folders click on between VirtualBox and Windows, use Devices/Shared Folders. Afterwards, add your user to the vboxsf group:
sudo adduser `whoami` vboxsf
See also my post on installing R on Ubuntu.
#download your favourite genome #hg19 for me wget -O hg19.tar.gz http://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/chromFa.tar.gz rm *random.fa rm chrUn_gl0002* rm *hap*.fa for file in `ls *.fa | sort -k1V`; do echo $file; cat $file >> hg19.fa; done rm chr*.fa #download blat wget http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/blat/blat
Most personal computers available these days are powerful enough to be running a virtualised installation of an operating system. In my humble opinion, if you're setting up Windows for bioinformatics, the easiest thing to do is just to install VirtualBox and Ubuntu, and installing the bioinformatic programs in that Ubuntu instance.
6th April 2012: For a more updated version of this post, please refer see this post.
A very simple comparison
Using the TagSeqExample.tab file from the DESeq package as the benchmark dataset. According to DESeq authors, T1a and T1b are similar, so I removed the second column in the file corresponding to T1a:
cut --complement -f2 TagSeqExample.tab > tag_seq_example_less_t1a.tsv
To get an idea of how similar the libraries are, I performed hierarchical clustering using the Spearman correlation of libraries (see here). Note that these raw reads have not been adjusted for sequencing depth, but hopefully by using a ranked correlation method we can bypass this necessity.
From the dendrogram, we observe the marked difference of the normal samples to the tumour samples.
Installing the R packages
I then installed all the required packages (R version 2.12.0, biocinstall version 2.7.4.):
source("http://www.bioconductor.org/biocLite.R") biocLite("baySeq") biocLite("DESeq") biocLite("edgeR")
DESeq code adapted from the vignette:
countsTable <- read.delim("tag_seq_example_less_t1a.tsv", header=TRUE, stringsAsFactors=TRUE) rownames(countsTable) <- countsTable$gene countsTable <- countsTable[,-1] head(countsTable) conds = c("T","T","T","N","N") cds<-newCountDataSet(countsTable,conds) cds<-estimateSizeFactors(cds) sizeFactors(cds)
cds <- estimateVarianceFunctions (cds) res <- nbinomTest (cds, "T", "N") resSig <- res[ res$padj < .1, ] nrow(resSig)  1072 write.csv(resSig,file="deseq_res_sig.csv")
First transform the tag_seq_example_less_t1a.tsv file to the edgeR format using this script.
library(edgeR) targets = read.delim(file = "targets.txt", stringsAsFactors = FALSE) d = readDGE(targets,comment.char="#") d$samples
Recently edgeR added a normalisation function into the package called calcNormFactors(). For more information see this paper.
d = calcNormFactors(d) d$samples
This approach is different from the one taken by DESeq.
d = estimateCommonDisp(d) de.com = exactTest(d) options(digits=4) topTags(de.com) detags.com = rownames(topTags(de.com)$table) sum(de.com$table$p.value < 0.01)  895 sum(p.adjust(de.com$table$p.value,method="BH") < 0.1)  593 good = sum(p.adjust(de.com$table$p.value,method="BH") < 0.1) goodList = topTags(de.com, n=good) sink("edger.txt") goodList sink()
library(baySeq) all = read.delim("tag_seq_example_less_t1a.tsv", header=TRUE, sep="\t") lib <- c(2399311,7202770,5856461,6376307,3931277) replicates <- c(1,1,1,2,2) groups <- list(NDE = c(1,1,1,1,1), DE = c(1,1,1,2,2) ) cname <- all[,1] all <- all[,-1] all = as.matrix(all) CD <- new("countData", data = all, replicates = replicates, libsizes = as.integer(lib), groups = groups) CD@annotation <- as.data.frame(cname) cl <- NULL CDP.NBML <- getPriors.NB(CD, samplesize = 1000, estimation = "QL", cl = cl) CDPost.NBML <- getLikelihoods.NB(CDP.NBML, pET = 'BIC', cl = cl) CDPost.NBML@estProps  0.8897621 0.1102379 topCounts(CDPost.NBML, group=2) NBML.TPs <- getTPs(CDPost.NBML, group=2, TPs = 1:100) bayseq_de = topCounts(CDPost.NBML, group=2, number=2000) write.csv(bayseq_de, file="bayseq_de.csv")
|R package||Number of differentially expressed genes|
|DESeq with edgeR||591|
|DESeq with baySeq||488|
|edgeR with baySeq||465|
|DESeq with edgeR with baySeq||338|
This is a first draft post and will be continually updated in the future, but I just thought I'll publish it first.