Ten years

As of today, it has been a decade since my first post on this blog. It started aimlessly during my PhD as a place to post analysis notes for myself and ten years later it still remains so. However, over the years I have started to focus more on better project management practices and on the reproducibility of my analyses. Below are a list of things that I now commonly practise.

For my projects and subsequent analyses, I use the workflowr framework to organise my work. Using this framework, all my analyses are version controlled using git and can be easily uploaded to GitHub. For some projects, I also use the Git Large File Storage (LFS) service to conveniently store large files in the same repository. However, as far as I'm aware, Git LFS isn't free.

I use RStudio via a Docker container that is running RStudio Server. By mounting my workflowr directory to the Docker container, I can have the same working environment across different computing environments.

For bioinformatic tools and packages that are not R packages, I try to see if the tool is packaged on Bioconda so that I can simply install it using Conda. This is much simpler than trying to compile the tools yourself. I tend to keep different Conda environment for tools to avoid package conflicts and long environment solve times. If I have to compile a tool, I use Docker and this base image, which has a lot of typically required libraries pre-installed, for compiling.

I gave a workshop at BioC Asia last year specifically on Docker, Conda, and workflowr if you are interested in learning more about these tools.

In addition to Conda, I have also started using Environment Modules to control my working environment. It is quite easy to install and modulefiles are straightforward to write.

I use a workflow management system for my bioinformatics workflows. Specifically, I use WDL and Cromwell. I wrote a bit more about workflow management systems here.

My mentality nowadays is to script everything as much as possible and try to write generalised scripts that can be easily reused. One way to easily reuse scripts is to write scripts that accept command line arguments. You can even write R Markdown files that accept parameters by writing parameterised R Markdown files. I have an example in my learning_vcf_file repository.

In the past I would try to implement a solution myself instead of spending more time looking for a tool that already does what I want, i.e. don't re-invent the wheel. (Quite often this tool is bedtools because it can do everything.) In the same vein as being more pragmatic, I now tend to try to get a tool or workflow working first and focus on fine tuning/sanity checking afterwards because as you may or may not know, it is sometimes quite difficult just to get a tool working.

I spend a bit more time with creating nicer visualisations using ggplot2 because figures typically convey a message much clearer than text and are more interesting to look at.

Bioinformatics and computing in general is changing rapidly. I'm sure all the tools I am currently using will be succeeded with bigger, faster, and shinier replacements. But I'll probably be still using Perl 🙂

Print Friendly, PDF & Email



Creative Commons License
This work is licensed under a Creative Commons
Attribution 4.0 International License
.
6 comments Add yours
  1. Thanks for the tips for make reproducible analyses.
    A Perl a day makes a happy Dave 🙂

    Best
    Lifeng

  2. Thanks Dave. Your blog is super valuable to a lot of people so thanks for the last 10 years and hoping it continues for more!

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.