I will be giving a workshop titled "Reproducible Bioinformatics" at BioC Asia tomorrow. I have been thinking a lot about this topic and my aim for the workshop is to introduce computational tools and demonstrate how they can be used to help promote reproducibility when performing bioinformatic analyses. Ensuring reproducibility shouldn't be an extra burden but should be embedded in the way people work, especially when there are so many great tools and initiatives that have made it much easier to work in a reproducible manner.
Rules for reproducible research
I'm a fan of the Ten Simple Rules series in PLOS Computational Biology. Ten Simple Rules for Reproducible Computational Research is a great article that provides very useful tips for working in a reproducible manner. Here are the ten rules:
* Rule 1: For Every Result, Keep Track of How It Was Produced
* Rule 2: Avoid Manual Data Manipulation Steps
* Rule 3: Archive the Exact Version Versions of All External Programs Used
* Rule 4: Version Control All Custom Scripts
* Rule 5: Record All Intermediate Results, When Possible in Standardised Formats
* Rule 6: For Analyses That Include Randomness, Note Underlying Random Seeds
* Rule 7: Always Store Raw Data behind Plots
* Rule 8: Generate Hierarchical Analysis Output, Allowing Layers of Increasing Detail to Be Inspected
* Rule 9: Connect Textual Statements to Underlying Results
* Rule 10: Provide Public Access to Scripts, Runs, and Results
The workflowr R package helps researchers organise their analyses in a way that promotes effective project management, reproducibility, collaboration, and sharing of results. Workflowr combines literate programming (knitr and rmarkdown) and version control (Git, via git2r) to generate a website containing time-stamped, versioned, and documented results. Any R user can quickly and easily adopt workflowr.
If you follow the literate programming style using R Markdown for your R analyses, you should already be adhering to rules 1, 7, and 9. This is because your analysis notes and analysis code (including plotting code) are all in the same document/s. Workflowr takes care of rule 3 (for R package versions), 4, 6, and 10 with its framework. Rules 2, 5, and 8 depend on how you conduct your analyses and are good practices.
Best Practices for Scientific Computing provides various guidelines when developing software and overlap ideas behind rules 2, 5, and 8 from Ten Simple Rules for Reproducible Computational Research. Here are the best practices:
1. Write programs for people, not computers
2. Let the computer do the work
3. Make incremental changes
4. Don't repeat yourself (or others)
5. Plan for mistakes
6. Optimise software only after it works correctly
7. Document design and purpose, not mechanics
The literate programming paradigm is especially focused on making code and its logic more readable by humans. The idea of letting the computer do the work, such as writing scripts to automate tasks, also promotes the idea of avoiding manual data manipulation steps. Producing incremental changes has ideals that are similar to generating hierarchical analysis outputs and version controlling, as this makes code easier to track and troubleshoot. Rule 4 of not repeating yourself includes modularising your code either as functions or packages and not to reinvent the wheel, i.e. don't write your own program if such a program already exists, which brings me to my next section.
While R can be used for many different types of bioinformatic analyses, there are still some bioinformatic tools that need to be installed external to R. One such tool that has simplified the installation and management of packages/tools is Conda. From the Conda documentation:
Conda is an open source package management system and environment management system that runs on Windows, macOS and Linux. Conda quickly installs, runs and updates packages and their dependencies. Conda easily creates, saves, loads and switches between environments on your local computer. It was created for Python programs, but it can package and distribute software for any language.
Installing bioinformatic tools can be a challenge because of various dependencies that may be missing from your system, which can spiral into dependency hell. Conda simplifies this process by handling libraries dependencies for you and installing a tool is as easy as issuing one command on the terminal, whether on your own computer or on a high performance computing cluster (HPCC). Most widely used bioinformatic tools are available on the Bioconda channel, which hosts over 6,000 bioinformatics packages.
Another feature of Conda are environments, which have two advantages:
1. It allows the use of multiple versions of the same package, which may be incompatible otherwise. Some bioinformatic tools require Python 2 and others require Python 3; using different Conda environments makes it easy to use both Python versions.
2. It allows the creation of isolated and distinct environments that can be shared with others. Conda environments can be exported to a file and used to recreate the exact environment you used for your analysis. This is useful for sharing your workspace so that others can reproduce your analysis.
Another tool that can be used to create an isolated and shareable environment is Docker, though it achieves this in a manner different to Conda. Docker is an open source project that allows one to pack, ship, and run any application as a lightweight container. From the official documentation:
A container is a standard unit of software that packages up code and all its dependencies so the application runs quickly and reliably from one computing environment to another. A Docker container image is a lightweight, standalone, executable package of software that includes everything needed to run an application: code, runtime, system tools, system libraries and settings.
Docker and Conda can both be used to ease the installation of bioinformatic tools and create isolated environments. So why should you learn about both tools? I have not tried to build a Conda package but to me it seems easier to package a tool using Docker. You can use Docker to run an instance of your favourite Linux distribution, akin to running a virtual machine, and work as if you were working in a Linux environment. On a side note, if you pull the official Ubuntu image (or some other distro) from Docker Hub, this image will contain a minimal set of libraries and tools. But since you will have admin privileges when running Linux using Docker, you can simply install the missing libraries and tools yourself. One potential downside is that your HPCC may not have Docker installed (or Singularity) so in the end you may have to resort to Conda.
The BioContainers initiative have packaged many bioinformatic tools that can simply be pulled from their Docker Hub account. The Rocker Project have built various R images with pre-installed packages and libraries making it easy to share your R environment. I have written how you can run RStudio Server from Docker, which can be another avenue for sharing your exact working environment in R.
Bioinformatics is a multidisciplinary field where its practitioners come from various disciplines and have different skill sets. Many people, including myself, have no formal training in computer science and are mostly self-taught; as a consequence, they may not be aware of best practices when it comes to coding or performing an analysis. My goal for the upcoming Reproducible Bioinformatics workshop is to inform people of some best practices and demonstrate how the use of some computational tools can help promote reproducibility.
I believe that everyone should invest time to learn more about reproducible research and as soon as possible. Datasets are getting larger and analysis workflows are getting more complex; it is easy to drown in data if good organisational practices are not followed. Furthermore, you will most likely be analysing your own dataset because not everyone has access to a bioinformatician. This should be motivating as you are probably the best person to analyse your own dataset because you have the domain-specific knowledge that can help with the interpretation of the data.
I'm still working on the workshop material, which will be available in this GitHub repository.
This work is licensed under a Creative Commons
Attribution 4.0 International License.