I have been using simple shell scripts for creating my bioinformatic pipelines. I define variables that can be used as parameter settings throughout the script, use some basic Unix tools for creating my output file names, and simply check the existence of files to see whether a step has been run or not. You can create a temporary file as a store for the results and only when a step has completed is this file moved to its final name; therefore, we can simply check the existence of a file to determine whether a step has completed or not. However, this approach is very basic.
I recently learned about Bpipe, which is a tool for running and managing bioinformatics pipelines. In their documentation they show how a pipeline implemented as a shell script can be simply transformed into Bpipe stages. The first thing you may notice is that the steps of the pipeline are now separated into different modules. If you go through the rest of the tutorial you can see the other features, such as defining variables, having Bpipe handle the input and output names, and having each step nicely logged.
Getting started
To get started we simply need to download the tarball and the binaries are simply in the bin directory:
wget http://download.bpipe.org/versions/bpipe-0.9.8.7.tar.gz tar -xzf bpipe-0.9.8.7.tar.gz ls bpipe-0.9.8.7/bin/bpipe bpipe-0.9.8.7/bin/bpipe
Following the Hello World example, we create a file named hello.pipe:
cat hello.pipe hello = { exec "echo Hello" } world = { exec "echo World" } Bpipe.run { hello + world }
To run the pipeline, which is simply two echo steps, we type:
bpipe hello.pipe ==================================================================================================== | Starting Pipeline at 2015-05-28 09:31 | ==================================================================================================== =========================================== Stage hello ============================================ Hello =========================================== Stage world ============================================ World ======================================== Pipeline Succeeded ======================================== 09:31:25 MSG: Finished at Thu May 28 09:31:25 WST 2015 /Users/dtang/bin/bpipe: line 724: 6809 Terminated: 15 ( tail -f $LOGFILE | sed -l "$TAIL_PATTERN" )
There is also a commandlog.txt file, which contains a log:
cat commandlog.txt #################################################################################################### # Starting pipeline at Thu May 28 09:31:25 WST 2015 # Input files: null # Output Log: .bpipe/logs/6795.log # Stage hello echo Hello # Stage world echo World # ################ Finished at Thu May 28 09:31:25 WST 2015 Duration = 0.681 seconds #################
The next example shows how you can use the Bpipe $input and $output variables. Firstly we need to create a test file:
#create test file echo Bpipe > test.txt #what's inside the second pipeline file cat hello2.pipe hello = { exec "echo Hello | cat - $input > $output" } world = { exec "echo World | cat $input - > $output" } run { hello + world } #running the pipeline bpipe run hello2.pipe test.txt ==================================================================================================== | Starting Pipeline at 2015-05-28 09:38 | ==================================================================================================== =========================================== Stage hello ============================================ =========================================== Stage world ============================================ ======================================== Pipeline Succeeded ======================================== 09:38:10 MSG: Finished at Thu May 28 09:38:10 WST 2015 09:38:10 MSG: Output is test.txt.hello.world /Users/dtang/bin/bpipe: line 724: 7187 Terminated: 15 ( tail -f $LOGFILE | sed -l "$TAIL_PATTERN" ) #two output files are created cat test.txt.hello Hello Bpipe cat test.txt.hello.world Hello Bpipe World
The output files have the steps appended to the file names. We can specify the file types to the $input and $output variables, to create files with the correct extension:
cat hello3.pipe hello = { exec "echo Hello | cat - $input.txt > $output.txt" } world = { exec "echo World | cat $input.txt - > $output.txt" } run { hello + world } bpipe run hello3.pipe test.txt ==================================================================================================== | Starting Pipeline at 2015-05-28 09:47 | ==================================================================================================== =========================================== Stage hello ============================================ =========================================== Stage world ============================================ ======================================== Pipeline Succeeded ======================================== 09:47:17 MSG: Finished at Thu May 28 09:47:17 WST 2015 09:47:17 MSG: Output is test.hello.world.txt /Users/dtang/bin/bpipe: line 724: 7736 Terminated: 15 ( tail -f $LOGFILE | sed -l "$TAIL_PATTERN" ) cat test.hello.txt Hello Bpipe cat test.hello.world.txt Hello Bpipe World
Bpipe also allows annotations, that can be used to annotate each stage:
cat hello4.pipe @Filter("hello") hello = { exec "echo Hello | cat - $input > $output" } @Filter("world") world = { exec "echo World | cat $input - > $output" } run { hello + world } bpipe run hello4.pipe test.txt ==================================================================================================== | Starting Pipeline at 2015-05-28 09:53 | ==================================================================================================== =========================================== Stage hello ============================================ =========================================== Stage world ============================================ ======================================== Pipeline Succeeded ======================================== 09:53:57 MSG: Finished at Thu May 28 09:53:57 WST 2015 09:53:57 MSG: Output is test.hello.world.txt /Users/dtang/bin/bpipe: line 720: 8121 Terminated: 15 ( tail -f $LOGFILE | sed -l "$TAIL_PATTERN" ) cat test.hello.txt Hello Bpipe cat test.hello.world.txt Hello Bpipe World
Notice that extensions were not given to the $input and $output variables, but the output file names had the correct file extensions. The @Filter annotation, refers to filtering where by a file is modified by the type remains the same; since we started with a txt file, we end up with a txt file.
Summary
Bpipe has many other features, which are listed out at the main page of the documentation. I'm not very familiar with pipeline tools (since I've just been using shell scripts) but Bpipe seems like a good starting point. I tweeted about Bpipe a few days ago and Harold mentioned Snakemake to me; I'll definitely check that out too!
@davetang31 seems useful if testing many diff types of pipelines. snakemake seems a bit more practical if analyzing many samples similarly
— Harold Pimentel (@hjpimentel) May 26, 2015
This work is licensed under a Creative Commons
Attribution 4.0 International License.
I am not sure I understand the utility of this program. If you want command logging in your bash script, just include `set -x` at the start and pipe stderr to a log file. If you want a log of the script itself, just include `cat $0 > scriptlog.txt`. If you want to check to if a step was already executed, then use `if [ ! -f output.file ]; then ; fi. If you want to log program versions, then run ` –version`. There are native bash solutions to most of the problems that bpipe seems to try to fix.