Deciding which bioinformatics tool to use

I just finished reading "Using prototyping to choose a bioinformatics workflow management system", which I summarised on Mastodon as follows:

Enjoyed reading "Using prototyping to choose a #bioinformatics workflow management system". Paper describes authors' 10 day experience searching and implementing a workflow. Summary: Need to decide which tool to use? Shortlist a list of potentially useful tools based on your needs. Start using each tool on a simpler problem. Assess the suitability of each tool. Paper contains useful tips for building reproducible workflows and links to many useful resources.

My summary doesn't really do it justice and I recommend reading it even if you don't build bioinformatics workflows/pipelines; you'll appreciate the steps necessary for creating reproducible workflows and will be better for it. For this post I'd like to talk about their shortlisting criteria, which I think is very useful in general when deciding which bioinformatics tool to use.

The first step is to determine the specific requirements of your analysis/project. You can't look for what you want, if you don't know what you need. Ask what tools people recommend on https://www.biostars.org/ (check out Ten Simple Rules for Getting Help from Online Scientific Communities), search through https://bio.tools/, and use an Internet search engine to look for available software that may suit your needs.

The following are selection criteria from the paper; use it to shortlist a smaller list of tools to trial.

Criteria Description
Popularity The system seems to be in common use and is well regarded within the bioinformatics community. The system is likely to be practically usable.
Free and open-source licence The system is free and has an open-source licence.
Well established, stable, and with a future The system has been around for at least a year, has regular releases, and there is evidence that it is actively maintained, developed, and supported. Development of the system is unlikely to stop after we migrate to it.
Text-based workflow development environment The system uses a command-line, text editor-based workflow development environment.
Ease of initial use Ease of download, install, and initial use of each system.
Quality of supporting documentation Readability and utility of documentation and tutorials

Hopefully there are a list of tools that you can shortlist. If there is only one tool that does what you need, then you have to decide whether to use it or (whether it is worth your time) to create your own tool. Apply the selection criteria to your specific needs. For example, if you want your work to be reproducible (and of course you do!), you should use a tool with an open-source licence. What does a licence have to do with reproducibility? There's an example from the book Building reproducible analytical pipelines with R describing the difficulty of getting an older version of MATLAB (a commercially available tool) for reproducing an analysis. If you can't re-run code using the same version of a tool, there's no guarantee that the results will be reproducible.

If a tool is hosted on a code repository, like GitHub or GitLab, it's easier to tell how actively maintained the tool is and its popularity. If it is only hosted on some university, organisational or personal server, it's a bit harder to judge and the longevity of the tool is also in question. If the tool is associated with a publication, the number of citations can also be used as an indication of its popularity. If the source code is available, that is a plus, especially if the code is written in a language that you are familiar with meaning that you can contribute to the tool.

A tool should definitely support a command line interface because that is much easier to implement into a pipeline/workflow. It's not impossible to reproduce results using a graphical user interface but it is error-prone and takes more time.

Many bioinformatics tools nowadays are packaged using Conda, the Python Package Index (Pip), Docker or all of them! This is great because it usually makes installing the tool easier. Sometimes a tool can simply be downloaded as a statically linked binary, which means that you just have to download it! If you have to compile a tool, hopefully there's enough documentation to guide you through the process. If I have to compile a tool, I typically do so inside a Docker container.

Personally, documentation is important. I like to know the details and what's going on (as much as I can comprehend) by reading documentation and not having to look at code or do a lot of testing. Bioinformatics documentation is a bit more unique because it needs to (and should) accommodate to people with different backgrounds. I think at the very least a tool should have a simple tutorial or vignette to get someone started. I get that developers what to work on the tool/code and apply it to answer important questions, but there should be sufficient documentation to illustrate how to use a tool and to interpret its results (if they want others to use their tool).

Conclusion

It takes time to learn how a tool works and to implement it into your workflow. Picking the most appropriate tool right at the start of a project can save you a lot of time in the long run. Always plan ahead.

Print Friendly, PDF & Email



Creative Commons License
This work is licensed under a Creative Commons
Attribution 4.0 International License
.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.