Setting up Windows for bioinformatics

Updated 2014 May 20th: I created this post on the 19th of August in 2011 when I just purchased yet another computer (Core i7 2600, 8 gigs ram and a GTX460) and was setting it up for doing some bioinformatics. I updated this page again around mid 2013 when I bought another laptop (Asus N56V, Core i7 3630QM with 12 gigs ram) and was setting it up for work again. I'm updating this page again because I'm seeing some increase in traffic to this page (and thankfully not because I had purchased another computer!).

I use Windows on all of my computers. Using just Windows for bioinformatics is not impossible but it's really just easier to have access to a Linux operating system. In the case of my desktop PC, I have a dual boot setup (Ubuntu and Windows 8) and for my laptop, which came pre-installed with Windows 8 making it a pain to setup Ubuntu, I use VirtualBox to have access to Linux.

Each time I set up a new computer, I always install the list of programs below:

Putty: SSH client
Xming X Server: X Window System Server for Windows
WordWeb: handy dictionary program, which looks up any word you highlight
Launchy: a keystroke program, for quick access to programs
7zip: general purpose zip program
R for Windows
RStudio
Opera: still one of my favourite web browsers and email client
Avast: antivirus
ActivePerl
Cygwin: Linux emulator on Windows
Dropbox: cloud file sharing program
VirtualBox: virtualisation software

My Linux distribution of choice is Ubuntu and using VirtualBox you can have Ubuntu installed inside your Windows installation. Below I outline a list of must have Linux bioinformatic tools for those working in the field of genomics and transcriptomics.

Ubuntu

Download Ubuntu (I would recommend Ubuntu 12.04 LTS). After installing VirtualBox and Ubuntu, here's what I installed immediately:

#VirtualBox guest additions:
sudo ./VBoxLinuxAdditions.run

#zlib (for bwa)
sudo apt-get install zlib*

#download bwa: http://sourceforge.net/projects/bio-bwa/files/
tar -xjf bwa-0.7.5a.tar.bz2
cd bwa-0.7.5a/
make

#install ncurses (for samtools)
sudo apt-get install ncurses-dev

#download SAMTools: http://sourceforge.net/projects/samtools/files/
tar -xjf samtools-0.1.19.tar.bz2
cd samtools-0.1.19/
make

#for BEDTools2
sudo apt-get install build-essential g++
git clone https://github.com/arq5x/bedtools2.git
cd bedtools2
make clean; make all

#FASTX-Toolkit: http://hannonlab.cshl.edu/fastx_toolkit/download.html
wget http://hannonlab.cshl.edu/fastx_toolkit/libgtextutils-0.6.1.tar.bz2
tar -xjf libgtextutils-0.6.1.tar.bz2 
cd libgtextutils-0.6.1/
./configure
make
make check
sudo make install

wget http://hannonlab.cshl.edu/fastx_toolkit/fastx_toolkit-0.0.13.2.tar.bz2
tar -xjf fastx_toolkit-0.0.13.2.tar.bz2
cd fastx_toolkit-0.0.13.2/
./configure
make
sudo make install

#git
sudo apt-get install git-core

For sharing folders click on between VirtualBox and Windows, use Devices/Shared Folders. Afterwards, add your user to the vboxsf group:

sudo adduser `whoami` vboxsf

See also my post on installing R on Ubuntu.

Others

#download your favourite genome
#hg19 for me
wget -O hg19.tar.gz http://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/chromFa.tar.gz
rm *random.fa
rm chrUn_gl0002*
rm *hap*.fa
for file in `ls *.fa | sort -k1V`; do echo $file; cat $file >> hg19.fa; done
rm chr*.fa

#download blat
wget http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/blat/blat

Conclusions

Most personal computers available these days are powerful enough to be running a virtualised installation of an operating system. In my humble opinion, if you're setting up Windows for bioinformatics, the easiest thing to do is just to install VirtualBox and Ubuntu, and installing the bioinformatic programs in that Ubuntu instance.

Making a line chart with non-numerical x axis

Basic example of creating a line chart with user defined x axis values using R.

opar=par(ps=18)
label = c('no_filter',9,8,7,6,5,4)
a <- c("0.4682953","0.466284","0.4587435","0.4095376","0.4444738","0.7144069","1.105043")
b <- c("0.9562088","0.953856","0.9104818","0.7554028","0.64136","0.877509","1.125698")
c <- c("0.7536005","0.7487367","0.7200604","0.6408311","0.5488365","0.6355055","1.051849")
d <- c("0.6601285","0.6566467","0.623516","0.5532256","0.5434039","0.6835916","1.047395")
e <- c("0.7536913","0.7511848","0.7338917","0.6548796","0.5129727","0.6585963","0.9883826")
f <- c("0.5596907","0.5595791","0.5512355","0.5178115","0.5014316","0.5900139","0.9123776")
g <- c("0.4868574","0.4866527","0.4776274","0.4359562","0.3950309","0.5714427","1.190739")
plot(a,axes=F,xlab="",ylab="",type="b",col="red")
lines(b,type="b",col="orange")
lines(c,type="b",col="yellow")
lines(d,type="b",col="green")
lines(e,type="b",col="blue")
lines(f,type="b",col="purple")
lines(g,type="b",col="violet")
axis(2)
axis(1,at=1:length(label),labels=label)
title(main = "main", xlab="xlab", ylab = "ylab")
legend(4,1.1,c("a","b","c","d","e","f","g"),col=c("red","orange","yellow","green","blue","purple","violet"),lty=c(1,1,1,1,1,1,1),lwd=c(1,1,1,1,1,1,1))

opar=par(ps=18)
label = c('no_filter',9,8,7,6,5)
data = read.table("file.tsv",header=F,sep="\t")
data = data[,-1]
a = as.vector(t(data[1,]))
b = as.vector(t(data[2,]))
c = as.vector(t(data[3,]))
d = as.vector(t(data[4,]))
e = as.vector(t(data[5,]))
f = as.vector(t(data[6,]))
g = as.vector(t(data[7,]))
h = as.vector(t(data[8,]))
i = as.vector(t(data[9,]))
j = as.vector(t(data[10,]))
k = as.vector(t(data[11,]))
l = as.vector(t(data[12,]))
m = as.vector(t(data[13,]))
n = as.vector(t(data[14,]))
range(as.vector(t(data))) #get the range
yrange = c(0.2,0.2,0.2,0.2,0.2,0.7)
plot(yrange,type="n",axes=F,ylab="",xlab="")
lines(a,type="b")
lines(b,type="b")
lines(c,type="b")
lines(d,type="b")
lines(e,type="b")
lines(f,type="b")
lines(g,type="b")
lines(h,type="b")
lines(i,type="b")
lines(j,type="b")
lines(k,type="b")
lines(l,type="b")
lines(m,type="b")
lines(n,type="b")
axis(2)
axis(1,at=1:length(label),labels=label)
title(main = "main", xlab="xlab", ylab = "ylab")