Reading the early release of “Bioinformatics Data Skills”

2014 August 11th: I noticed that some people are coming across this post by googling "bioinformatics data skills pdf". IMHO, this book is well worth the 40 dollars (price of the ebook) and I'm sure the author went through a lot of hard work to get it published, so please compensate him for his work.

This is going to be the first ever book commentary I've written that isn't for a school assignment (and also the first on this blog). I felt compelled to write it because I could relate to many of the issues highlighted in the early release version of the "Bioinformatics Data Skills" book and I wanted to share my thoughts. Just like in school, I wasn't aptly suited to review (nor deeply appreciate) classics such as "Lord of the Flies" or "Brave New World" but I could offer my interpretation based on what I gathered. Similarly, I'm not a bioinformatics expert but I can provide my thoughts on the book, for what it's worth.

In one sentence the book is about developing good bioinformatic practices by understanding computational principles, so that ultimately our work can be reproduced by ourselves and more importantly, by others. Recently there has been a stream of comic strips over at PhD comics highlighting problems that occur when non-programmers (i.e. someone without formal programming training) start to program; this particular strip is all too familiar. While we may look at the strip and laugh, this is a serious problem in reproducibility (because of poor documentation) and using one of the quotes shared in the book:

"...non-reproducible single occurrences are of no significance to science"
--Karl Popper in The Logic of Scientific Discovery

Now just some background to help understand why I could deeply relate to the lessons taught in the book. During my undergraduate years, I studied mainly biochemistry and microbiology subjects and only switched to bioinformatics during my Honours degree in Australia. It was a one year degree, so I had to learn bioinformatics fast in order to finish on time. I learned by trial and error, and by consulting Perl/Unix cookbooks or those "learn Perl/Unix in n hours" books. Basically I learned the wrong way; what I needed was to develop the right mentality but this mentality only started to develop during my PhD years. Here's one example from the book that really hit the spot for me; I will paraphrase it:

There's a point in every nascent programmer's career when they feel compelled to write their own code for a task, instead of using an existing library or package.

A while ago, I would think to myself, why complicate a script by depending on libraries? I could write a script that's entirely self contained with my own code and functions. But of course, this would introduce my own idiosyncratic bugs and would not leverage the collective work of others, who have almost definitely written much better code that is probably actively being maintained. Reinventing the wheel is scorned upon (especially a wheel that's worst) and rightly so.

There are other mentalities that weren't ingrained in me, such as using structured vocabularies or structuring a file directory or even something as simple as naming files and folders. These things don't seem particularly important but they make a big difference and the book offers recommendations and explains why it's important. Learning how to be logical and structured makes the workflow much more robust, which means higher reproducibility.

Importantly, the book also teaches you how to use the tools that can help you achieve robust and reproducible research, including Markdown, Git, Unix and some of its core utilities. There are also tips and advice sprinkled throughout the book, which have come from the author's own personal experience. Here's a post on Biostar that's unrelated to the book but lists some common bioinformatic mistakes mentioned in the book (such as accidentally overwriting your fasta file when using grep for the pattern > or off-by-one errors). These sections helped patch up some holes in my own understanding (for example, I had no idea why the syntax for redirecting STDERR is 2>, I just remembered it).

The timing of the book is perfect; there is just so much data out there and we need to learn how to deal with such large quantities of data. I wish I would have read this book 8-9 years ago, when I first started out on my own bioinformatics project, so that I didn't have to learn things the hard way. Personally I would highly recommend it to anyone who is taking the path from biology to bioinformatics or just starting out in bioinformatics. Of course reading the book isn't going to give you all the necessary bioinformatics data skills, but it's a great start. As with all things, practice makes perfect and bioinformatics is no exception; but having the right mentality to go with all that practice is equally important.

Lastly, the book is still being written and if I may suggest, here are some data skills that I think are essential to the modern day bioinformatician: "How to use a job scheduler (PBS, OGE) to submit jobs, etc." and "How to parallelise your work (perhaps a small section on GNU parallel?)".

Looking forward to reading the rest of the book!

Recommended reading

Ten Simple Rules for Reproducible Computational Research
A Quick Guide to Organizing Computational Biology Projects
Communicating with a lay audience about scientific subjects
Collection of published “guides” for bioinformaticians

Bioinformatics Data Skills

Print Friendly, PDF & Email

Creative Commons License
This work is licensed under a Creative Commons
Attribution 4.0 International License

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.