Updated 2018 May 29th to include example using a config file
As promised two years ago, here's a short blog post on Snakemake. I have been using Bpipe to manage my workflows/pipelines but Snakemake has been mentioned to me on more than one occasion; in particular:
@davetang31 seems useful if testing many diff types of pipelines. snakemake seems a bit more practical if analyzing many samples similarly
— Harold Pimentel (@hjpimentel) May 26, 2015
The main motive behind this post was that a colleague shared his work with me and he had used Snakemake to manage his analyses. To get started, here's the description of Snakemake from the documentation page:
Snakemake is a workflow management system that aims to reduce the complexity of creating workflows by providing a fast and comfortable execution environment, together with a clean and modern specification language in Python style. Snakemake workflows are essentially Python scripts extended by declarative code to define rules. Rules describe how to create output files from input files.
The rules follow that of GNU Make, which I wrote a blog post on. If you don't want to click on another post, here's the Makefile I created for that post:
all: data norm.png data: raw.tsv norm.tsv raw.tsv: get_data.pl get_data.pl > raw.tsv norm.tsv: norm.R raw.tsv R CMD BATCH norm.R norm.png: norm.R norm.tsv R CMD BATCH plot.R clean: rm -rf *.tsv *.Rout *.png
Installing Snakemake
From the installation page, it states that:
Snakemake is available on PyPi as well as through Bioconda and also from source code. You can use one of the following ways for installing Snakemake.
PyPi is the Python Package Index; Conda is a cross-platform package and environment manager that installs, runs, and updates packages and their dependencies.; and Bioconda is a channel for the conda package manager specializing in bioinformatics software. I opted for the pip3 option on my MacBook Air (I used Homebrew to install Python3):
# install Homebrew and Python3 # ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)" # brew update # brew upgrade # brew install python3 # pip3 install --upgrade pip setuptools wheel pip3 install snakemake ... Successfully built snakemake wrapt Installing collected packages: wrapt, idna, certifi, urllib3, chardet, requests, snakemake Successfully installed certifi-2017.4.17 chardet-3.0.4 idna-2.5 requests-2.18.1 snakemake-3.13.3 urllib3-1.21.1 wrapt-1.10.10 which snakemake /usr/local/bin/snakemake
Getting started
Let's get started with a simple example. From the Snakefile, we see a Snakemake rule, which is made up of:
- A rule (sort in the example below)
- An input
- An output
- A shell
The input and output directives are lists of files that are expected to be used and created, respectively, by the rule. The shell specifies the command to execute and we can refer to elements of the rule via braces notation.
# files in the root directory of this example ls -1 Snakefile test.txt # contents of test.txt cat test.txt 5 3 7 2 # Snakemake file cat Snakefile rule sort: input: "test.txt" output: "test.sorted.txt" shell: "sort -n {input} > {output}" # run Snakemake snakemake Provided cores: 1 Rules claiming more threads will be scaled down. Job counts: count jobs 1 sort 1 rule sort: input: test.txt output: test.sorted.txt jobid: 0 Finished job 0. 1 of 1 steps (100%) done # files in directory after running Snakemake ls -1 Snakefile test.sorted.txt test.txt # contents of test.sorted.txt cat test.sorted.txt 2 3 5 7
What if I wanted to sort multiple text files? The expand() function is a helper function provided by Snakemake for collecting input files; there is a replicate argument that can be used to generate file names with multiple patterns (that is not shown in the example below).
The input of rule all is the final output of the pipeline.
ls -1 Snakefile test1.txt test2.txt test3.txt # following the example in the Snakemake presentation cat Snakefile DATASETS = ["test1", "test2", "test3"] rule all: input: expand("{dataset}.sorted.txt", dataset=DATASETS) rule sort: input: "{dataset}.txt" output: "{dataset}.sorted.txt" shell: "sort -n {input} > {output}" snakemake Provided cores: 1 Rules claiming more threads will be scaled down. Job counts: count jobs 1 all 3 sort 4 rule sort: input: test1.txt output: test1.sorted.txt jobid: 1 wildcards: dataset=test1 Finished job 1. 1 of 4 steps (25%) done rule sort: input: test3.txt output: test3.sorted.txt jobid: 2 wildcards: dataset=test3 Finished job 2. 2 of 4 steps (50%) done rule sort: input: test2.txt output: test2.sorted.txt jobid: 3 wildcards: dataset=test2 Finished job 3. 3 of 4 steps (75%) done localrule all: input: test1.sorted.txt, test2.sorted.txt, test3.sorted.txt jobid: 0 Finished job 0. 4 of 4 steps (100%) done ls -1 Snakefile test1.sorted.txt test1.txt test2.sorted.txt test2.txt test3.sorted.txt test3.txt
Using a config file
Snakemake can use a config file, written in either JSON or YAML, for customising your pipeline; you will need to install the PyYAML package to use YAML. Below is a minimal example adapted from Kamil Slowikowski's tutorial on using Snakemake.
# generate some files touch genome.fa mkdir -p fastq/{Sample1,Sample2} touch fastq/Sample1/Sample1.R1.fastq.gz fastq/Sample1/Sample1.R2.fastq.gz touch fastq/Sample2/Sample2.R1.fastq.gz fastq/Sample2/Sample2.R2.fastq.gz # create config.yaml samples: Sample1: fastq/Sample1 Sample2: fastq/Sample2 # Snakemake file configfile: "config.yaml" rule all: input: expand("{sample}.txt", sample=config["samples"]), rule quantify_genes: input: genome = "genome.fa" output: "{sample}.txt" shell: "echo {input.genome} -s {wildcards.sample} -r1 fastq/{wildcards.sample}/{wildcards.sample}.R1.fastq.gz -r2 fastq/{wildcards.sample}/{wildcards.sample}.R2.fastq.gz > {output}" # run snakemake Provided cores: 1 Rules claiming more threads will be scaled down. Job counts: count jobs 1 all 2 quantify_genes 3 rule quantify_genes: input: genome.fa output: Sample2.txt jobid: 1 wildcards: sample=Sample2 Finished job 1. 1 of 3 steps (33%) done rule quantify_genes: input: genome.fa output: Sample1.txt jobid: 2 wildcards: sample=Sample1 Finished job 2. 2 of 3 steps (67%) done localrule all: input: Sample1.txt, Sample2.txt jobid: 0 Finished job 0. 3 of 3 steps (100%) done # check output cat Sample[12].txt genome.fa -s Sample1 -r1 fastq/Sample1/Sample1.R1.fastq.gz -r2 fastq/Sample1/Sample1.R2.fastq.gz genome.fa -s Sample2 -r1 fastq/Sample2/Sample2.R1.fastq.gz -r2 fastq/Sample2/Sample2.R2.fastq.gz
Tips
Taken from the Snakemake tutorial presentation (see Links).
# execute the workflow with target D1.sorted.txt snakemake D1.sorted.txt # execute the workflow without target: first rule defines target snakemake # dry-run snakemake -n # dry-run, print shell commands snakemake -n -p # dry-run, print execution reason for each job snakemake -n -r
- In addition, you can include Python code in the Snakefile including importing other Python packages
- Use the params directive to define arbitrary parameters
- Snakemake allows files to be marked as temporary, such that once a process that requires the file has completed, it is deleted.
Summary
I've only gone through the Snakemake presentation and there are a lot of features provided by the software. The most important to me are parallelisation and logging capabilities but I guess these should be core features of any workflow building tool. For more information on other tools, check out this Biostars post.
Links
- Snakemake presentation by the author of the tool.
- Snakemake tutorial

This work is licensed under a Creative Commons
Attribution 4.0 International License.
I followed your first example after installing snakemake and got following error.
I don’t see any difference in the command that I typed
Command must be given as string after the shell keyword. (Snakefile33, line 7)
I just copied and pasted the code and it worked for me. Did you use four and eight spaces to separate the respective lines?
Ah… Thank you for pointing that out. some how my vim setting did not correctly reflect this.
I appreciate this conversation. I just bumped the same wall. Problem solved by using eight spaces to separate lines rather than tab.
I’m glad this old post is still useful.