Learning about Snakemake

Updated 2018 May 29th to include example using a config file

As promised two years ago, here's a short blog post on Snakemake. I have been using Bpipe to manage my workflows/pipelines but Snakemake has been mentioned to me on more than one occasion; in particular:

The main motive behind this post was that a colleague shared his work with me and he had used Snakemake to manage his analyses. To get started, here's the description of Snakemake from the documentation page:

Snakemake is a workflow management system that aims to reduce the complexity of creating workflows by providing a fast and comfortable execution environment, together with a clean and modern specification language in Python style. Snakemake workflows are essentially Python scripts extended by declarative code to define rules. Rules describe how to create output files from input files.

The rules follow that of GNU Make, which I wrote a blog post on. If you don't want to click on another post, here's the Makefile I created for that post:

all: data norm.png
data: raw.tsv norm.tsv

raw.tsv: get_data.pl
	get_data.pl > raw.tsv

norm.tsv: norm.R raw.tsv
	R CMD BATCH norm.R

norm.png: norm.R norm.tsv
	R CMD BATCH plot.R

clean:
	rm -rf *.tsv *.Rout *.png

Installing Snakemake

From the installation page, it states that:

Snakemake is available on PyPi as well as through Bioconda and also from source code. You can use one of the following ways for installing Snakemake.

PyPi is the Python Package Index; Conda is a cross-platform package and environment manager that installs, runs, and updates packages and their dependencies.; and Bioconda is a channel for the conda package manager specializing in bioinformatics software. I opted for the pip3 option on my MacBook Air (I used Homebrew to install Python3):

# install Homebrew and Python3
# ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"
# brew update
# brew upgrade
# brew install python3
# pip3 install --upgrade pip setuptools wheel

pip3 install snakemake
...
Successfully built snakemake wrapt
Installing collected packages: wrapt, idna, certifi, urllib3, chardet, requests, snakemake
Successfully installed certifi-2017.4.17 chardet-3.0.4 idna-2.5 requests-2.18.1 snakemake-3.13.3 urllib3-1.21.1 wrapt-1.10.10

which snakemake
/usr/local/bin/snakemake

Getting started

Let's get started with a simple example. From the Snakefile, we see a Snakemake rule, which is made up of:

  1. A rule (sort in the example below)
  2. An input
  3. An output
  4. A shell

The input and output directives are lists of files that are expected to be used and created, respectively, by the rule. The shell specifies the command to execute and we can refer to elements of the rule via braces notation.

# files in the root directory of this example
ls -1
Snakefile
test.txt

# contents of test.txt
cat test.txt 
5
3
7
2

# Snakemake file
cat Snakefile 
rule sort:
    input:
        "test.txt"
    output:
        "test.sorted.txt"
    shell:
        "sort -n {input} > {output}"

# run Snakemake
snakemake
Provided cores: 1
Rules claiming more threads will be scaled down.
Job counts:
	count	jobs
	1	sort
	1

rule sort:
    input: test.txt
    output: test.sorted.txt
    jobid: 0

Finished job 0.
1 of 1 steps (100%) done

# files in directory after running Snakemake
ls -1
Snakefile
test.sorted.txt
test.txt

# contents of test.sorted.txt
cat test.sorted.txt 
2
3
5
7

What if I wanted to sort multiple text files? The expand() function is a helper function provided by Snakemake for collecting input files; there is a replicate argument that can be used to generate file names with multiple patterns (that is not shown in the example below).

The input of rule all is the final output of the pipeline.

ls -1
Snakefile
test1.txt
test2.txt
test3.txt

# following the example in the Snakemake presentation
cat Snakefile 
DATASETS = ["test1", "test2", "test3"]

rule all:
    input:
        expand("{dataset}.sorted.txt", dataset=DATASETS)

rule sort:
    input:
        "{dataset}.txt"
    output:
        "{dataset}.sorted.txt"
    shell:
        "sort -n {input} > {output}"

snakemake
Provided cores: 1
Rules claiming more threads will be scaled down.
Job counts:
	count	jobs
	1	all
	3	sort
	4

rule sort:
    input: test1.txt
    output: test1.sorted.txt
    jobid: 1
    wildcards: dataset=test1

Finished job 1.
1 of 4 steps (25%) done

rule sort:
    input: test3.txt
    output: test3.sorted.txt
    jobid: 2
    wildcards: dataset=test3

Finished job 2.
2 of 4 steps (50%) done

rule sort:
    input: test2.txt
    output: test2.sorted.txt
    jobid: 3
    wildcards: dataset=test2

Finished job 3.
3 of 4 steps (75%) done

localrule all:
    input: test1.sorted.txt, test2.sorted.txt, test3.sorted.txt
    jobid: 0

Finished job 0.
4 of 4 steps (100%) done

ls -1
Snakefile
test1.sorted.txt
test1.txt
test2.sorted.txt
test2.txt
test3.sorted.txt
test3.txt

Using a config file

Snakemake can use a config file, written in either JSON or YAML, for customising your pipeline; you will need to install the PyYAML package to use YAML. Below is a minimal example adapted from Kamil Slowikowski's tutorial on using Snakemake.

# generate some files
touch genome.fa
mkdir -p fastq/{Sample1,Sample2}
touch fastq/Sample1/Sample1.R1.fastq.gz fastq/Sample1/Sample1.R2.fastq.gz
touch fastq/Sample2/Sample2.R1.fastq.gz fastq/Sample2/Sample2.R2.fastq.gz

# create config.yaml
samples:
    Sample1: fastq/Sample1
    Sample2: fastq/Sample2

# Snakemake file
configfile: "config.yaml"

rule all:
    input:
        expand("{sample}.txt", sample=config["samples"]),

rule quantify_genes:
    input:
        genome = "genome.fa"
    output:
        "{sample}.txt"
    shell:
        "echo {input.genome} -s {wildcards.sample} -r1 fastq/{wildcards.sample}/{wildcards.sample}.R1.fastq.gz -r2 fastq/{wildcards.sample}/{wildcards.sample}.R2.fastq.gz > {output}"

# run
snakemake
Provided cores: 1
Rules claiming more threads will be scaled down.
Job counts:
        count   jobs
        1       all
        2       quantify_genes
        3

rule quantify_genes:
    input: genome.fa
    output: Sample2.txt
    jobid: 1
    wildcards: sample=Sample2

Finished job 1.
1 of 3 steps (33%) done

rule quantify_genes:
    input: genome.fa
    output: Sample1.txt
    jobid: 2
    wildcards: sample=Sample1

Finished job 2.
2 of 3 steps (67%) done

localrule all:
    input: Sample1.txt, Sample2.txt
    jobid: 0

Finished job 0.
3 of 3 steps (100%) done

# check output
cat Sample[12].txt
genome.fa -s Sample1 -r1 fastq/Sample1/Sample1.R1.fastq.gz -r2 fastq/Sample1/Sample1.R2.fastq.gz
genome.fa -s Sample2 -r1 fastq/Sample2/Sample2.R1.fastq.gz -r2 fastq/Sample2/Sample2.R2.fastq.gz

Tips

Taken from the Snakemake tutorial presentation (see Links).

# execute the workflow with target D1.sorted.txt
snakemake D1.sorted.txt

# execute the workflow without target: first rule defines target
snakemake

# dry-run
snakemake -n

# dry-run, print shell commands
snakemake -n -p

# dry-run, print execution reason for each job
snakemake -n -r
  • In addition, you can include Python code in the Snakefile including importing other Python packages
  • Use the params directive to define arbitrary parameters
  • Snakemake allows files to be marked as temporary, such that once a process that requires the file has completed, it is deleted.

Summary

I've only gone through the Snakemake presentation and there are a lot of features provided by the software. The most important to me are parallelisation and logging capabilities but I guess these should be core features of any workflow building tool. For more information on other tools, check out this Biostars post.

Links

Print Friendly, PDF & Email



Creative Commons License
This work is licensed under a Creative Commons
Attribution 4.0 International License
.
5 comments Add yours
  1. I followed your first example after installing snakemake and got following error.
    I don’t see any difference in the command that I typed

    Command must be given as string after the shell keyword. (Snakefile33, line 7)

    1. I just copied and pasted the code and it worked for me. Did you use four and eight spaces to separate the respective lines?

  2. I appreciate this conversation. I just bumped the same wall. Problem solved by using eight spaces to separate lines rather than tab.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.