Setting up Windows for bioinformatics in 2019

This is an update on my original post Setting up Windows for bioinformatics that I wrote in 2011. I had switched over to the Mac operating system (Mac OS X) for work when my HP laptop was replaced with a MacBook Air sometime in 2012. A few years later, I wiped my Windows installation from my home desktop computer and replaced it with Ubuntu. Since then, my interaction with the Windows operating system has been minimal. Recently, I re-installed Windows and have been quite impressed with its usability. This post is on how I set up my Windows desktop for bioinformatics in 2019.

One of the main reasons for installing Windows 10 was that I wanted to check out the Ubuntu integration with Windows. I remember reading about it a while ago when it was first introduced and thought this was a smart move by Microsoft; this is very handy for people who still want to use Windows and run bioinformatic tools, which are usually only available on a Linux system. For desktop and laptop computers, Windows is still the dominant operating system. The numbers for Windows are a bit lower for visitors to my blog; for visitors in the last 30 days, 52.9% use Windows, 33.4% use Mac, and 9% use Linux. Hence I thought this post may be of interest to many of my visitors.

Installing Ubuntu on Windows 10 is extremely easy; all you have to do is go to the Microsoft Store and look for Ubuntu. You can follow this guide if you need a bit more information.

Once installed you can start Ubuntu from the search bar. If you highlight text from the Ubuntu window, and right-click, the text is saved to the clipboard.

# the version of Ubuntu
cat /etc/os-release
NAME="Ubuntu"
VERSION="18.04.1 LTS (Bionic Beaver)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 18.04.1 LTS"
VERSION_ID="18.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=bionic
UBUNTU_CODENAME=bionic

# all my cores are recognised
cat /proc/cpuinfo | grep processor
processor       : 0
processor       : 1
processor       : 2
processor       : 3
processor       : 4
processor       : 5
processor       : 6
processor       : 7


The Windows filesystem is automatically mounted, so you can access all your Windows files on Ubuntu.

df -h
Filesystem      Size  Used Avail Use% Mounted on
rootfs          923G  224G  699G  25% /
none            923G  224G  699G  25% /dev
none            923G  224G  699G  25% /run
none            923G  224G  699G  25% /run/lock
none            923G  224G  699G  25% /run/shm
none            923G  224G  699G  25% /run/user
C:              923G  224G  699G  25% /mnt/c


You have read/write permissions to the Windows mount in Ubuntu, so you can write output files to the mount. If you want have a file shared across Windows and Ubuntu and need to preserve the file permission in Ubuntu (such as an SSH private key), you will need to edit or create a wsl.conf file inside /etc.

sudo vi /etc/wsl.conf


Add these two lines inside the conf file.

[automount]


Finally, restart WSL:

1. Press the Win Key + R, which will bring up a Windows Run box
2. Enter services.msc and hit enter
3. Look for LxssManager and restart this service

The next time you start Ubuntu, file permissions should be persistent.

If you wanted to do the opposite and access files from Ubuntu on Windows, you can navigate to this location (replacing Dave, unless you are also Dave):

C:\Users\Dave\AppData\Local\Packages\CanonicalGroupLimited.UbuntuonWindows_79rhkp1fndgsc\LocalState\rootfs

The string "79rhkp1fndgsc" may be different on your computer, I'm not sure. I've set up a shortcut to this location, so I can easily access Ubuntu files from my Windows File Explorer.

Since this is a fresh installation of Ubuntu it does not come with many utilities and bioinformatic tools; I even had to install unzip.

sudo apt install unzip


For installing bioinformatic tools, I recommend using Conda, which is a packaging tool and installer. Many of the popular bioinformatic tools can be installed using Conda. If you want more information, I have some notes on Conda on my Wiki site. I recommend using Miniconda over Anaconda, since Anaconda comes prepackaged with too many packages. To install Miniconda on Ubuntu, simply download a shell script, run it, and follow the instructions.

wget -c https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh


Installing BWA is a breeze and you don't have to worry about compiling the program yourself; all dependencies are taken care of.

conda install -c bioconda bwa

bwa

Program: bwa (alignment via Burrows-Wheeler transformation)
Version: 0.7.17-r1188
Contact: Heng Li <lh3@sanger.ac.uk>

Usage:   bwa <command> [options]

Command: index         index sequences in the FASTA format
mem           BWA-MEM algorithm
fastmap       identify super-maximal exact matches
pemerge       merge overlapping paired ends (EXPERIMENTAL)
aln           gapped/ungapped alignment
samse         generate alignment (single ended)
sampe         generate alignment (paired ended)
bwasw         BWA-SW for long queries

shm           manage indices in shared memory
fa2pac        convert FASTA to PAC format
pac2bwt       generate BWT from PAC
pac2bwtgen    alternative algorithm for generating BWT
bwtupdate     update .bwt to the new format
bwt2sa        generate SA from BWT and Occ

Note: To use BWA, you need to first index the genome with bwa index'.
There are three alignment algorithms in BWA: mem', bwasw', and
aln/samse/sampe'. If you are not sure which to use, try bwa mem'
first. Please man ./bwa.1' for the manual.


Another cool thing about Conda are environments, which are isolated workspaces. Imagine buying a brand new laptop that has nothing installed. You gradually install tools that you used for your analyses and save all your analysis code onto said laptop. After you have finished, you can give that laptop to someone else and they will have the exact environment you used to carry out your work. Conda environments are similiar only that you don't have to give your computer away. You can simply save the environment that you were working in and share it. Environments are also handy for providing a list of packages that need to be installed for a particular analysis.

For example, I have created an environment file in my learning VCF GitHub repository. If you clone this repository and use Conda to create the environment, you should be able to carry out the same analysis. I had performed the analysis on my MacBook Pro and was able to replicate the results on my Windows 10 machine using Ubuntu.

# create a copy of the code repository
git clone https://github.com/davetang/learning_vcf_file.git

# change directories
cd learning_vcf_file

# create an environment and install the necessary programs
# this may take some time depending on your internet speed
conda env create -f environment.yml

# activate the environment
source activate learning_vcf

# change into the analysis directory
cd analysis

# you will need to download GATK if you want to call variants using GATK
# this step may take some time depending on your internet speed
unzip gatk-4.1.1.0.zip

# the analysis may take some time depending on your computer
./run.sh


The run.sh script generates some random sequences and calls variants using BCFtools, FreeBayes, and GATK.

Summary

It is definitely much easier to setup a computer using Windows 10 for bioinformatic analyses in 2019! The main reason is due to native support of Ubuntu by Windows. Previously I recommended using VirtualBox, which worked, but was not as straightforward to install and use. With the advent of Conda, installing and managing bioinformatic tools has also become much easier. Another advantage of using Conda is the environment support. I showed an example where I created an environment that I used to call variants from some randomly generated sequences.

One more thing; if you use R, I would highly recommend using RStudio and keeping track of your analyses using R Markdown. Even if you don't use R, I still recommend using RStudio and R Markdown; I regularly use R Markdown in RStudio to write down technical notes that have nothing to do with R. The Vim keybinding support by RStudio is one major reason.

As a final note, my desktop computer is around 8 years old now and Windows 10 works surprisingly well. My computer starts up in less than a minute; if you have used Windows before, the start up speeds were notoriously bad. Despite mainly using my MacBook Pro for work these days, I can see myself going back to Windows!

Conclusion: use Ubuntu, use Conda, use RStudio, and use R Markdown.

.
1. Jyoti says:

Hi Dave,
Amazing post as always! I have a slightly different question. I have been dual-booting my laptops and workstations with most of my bioinformatics perfomred on Ubuntu and Microsoft office related stuff on Windows. Since Microsoft office is available on macOS, is it wise to just upgrade to a Macbook? How would you compare linux vs macOS for bioinformatics?

Thanks,
Jyoti

1. Davo says:

Hi Jyoti,

personally I use a MacBook Pro (15-inch, 2017) because I like the trackpad a lot and the battery life is good hence I’m using macOS. For any work that requires heavy computing I have to use servers, compute clusters, etc. which all use some flavour of Linux. For macOS or Linux, I primarily use Conda to install all my bioinformatic tools.

One major downside of the MacBook Pro is the price. You can get similar or better specs laptops for cheaper.

Cheers,
Dave

2. Linhe Xu says:

Great posts! I also found the new windows terminal (preview) very useful. You can find it in the microsoft store for free. I just started trying it, and feel it is even better than the Mac terminal.

3. K says:

Hi Dave,

This was a very useful post and gave me capability to do bioinformatics analysis on my windows computer.

Thank you very much!!!

Sincerely,
K

4. Anthony says:

Hi,
thanks for the article ^^

I am wondering if the performance under “Linux from Windows” is similar to “pure/only Linux PC”?
I will soon lose my access to a cluster… so I will need to perform my (basic) RNAseq/Variant Calling analysis on a future laptop, but I am still in the thinking mode “Windows -with Linux” or “MacOS”. As you said, for the same price I can obtain much better under Windows…

thx 🙂

1. Davo says:

Hi Anthony,

I don’t have dual booting on my PC, so I can’t test the performance of using Linux with Windows Subsystem for Linux (WSL2) versus installing Linux natively. I used to dual boot when WSL didn’t exist but not anymore because WSL is much easier. Most things work using WSL but sometimes I come across strange errors. If you do not need to use Windows, you are probably better off installing Linux natively.

I don’t know how long your RNA-seq and variant calling analyses runs for, but it probably isn’t a good idea to run long workflows on a laptop since most laptops don’t have very good cooling. I’d suggest you get a desktop PC instead but if purchasing is not up to you, then remember to monitor the temperature of different components on your future laptop.

Hope that helps,
Dave

1. Anthony says: