Learning about Makefiles

I started learning about Makefiles around the time I was learning about C. I still don't know much about Makefiles or C, but I'm revisiting Makefiles because I'm interested in using them for building reproducible pipelines. Initially I thought Makefiles were text files that were used to help compile software. However, as I learned from Twitter, they can be used for organising workflows (among many other things). Karl Broman believes that GNU make (which processes Makefile's) is the most important tool for reproducible research. The main reasons are that GNU make allows you to automate various processes and dependencies are defined within the Makefile.

A Makefile is formatted as such (where the whitespace in the second line is a tab):

target: dependencies
        system command(s)

The target is something to be made/produced, and the dependences are the files that are required for making the target. The system command(s) is/are the actual command/s that is/are executed. With this construct, the dependencies and command/s used to produce the target is clear. For example, your target could be an image file that corresponds to figure 1 of your manuscript. The Makefile will list the files that are required to produce figure 1 as well as the command/script that was used. This is very useful, since we can easily find out how figure 1 was made.

An example

I found two tutorials that were helpful in understanding how GNU make and Makefile work. The first is by Karl Broman and the second by Zachary Jones. I will follow the example in Zachary's post but with some scripts that I wrote. Here's my simple Makefile:

all: data norm.png
data: raw.tsv norm.tsv

raw.tsv: get_data.pl
    get_data.pl > raw.tsv

norm.tsv: norm.R raw.tsv
    R CMD BATCH norm.R

norm.png: norm.R norm.tsv
    R CMD BATCH plot.R

clean:
    rm -rf *.tsv *.Rout *.png

The first line is a dummy target called all and has two dependencies: another dummy target called data and the norm.png image file. The data target itself has two dependencies: raw.tsv and norm.tsv. The all target is good practice and is the one that is built be default when just calling make.

The raw.tsv target depends on the get_data.pl script and is produced by running the code get_data.pl > raw.tsv. To create raw.tsv, we run:

make raw.tsv

The norm.tsv target depends on the norm.R script and the raw.tsv data. If we try to create norm.tsv without first creating raw.tsv, make raw.tsv will be automatically run. Hence if we typed make all (without running anything else beforehand), all the dependencies will be created as well. The last target clean is a common target and simply runs a command that deletes all the output files.

Here are the scripts:

get_data.pl:

#!/usr/bin/env perl

use strict;
use warnings;

#set seed
srand(31);

my $c = 0;
for (1 .. 100){
   my $i = int(rand(1000));
   ++$c;
   print "$i";
   if ($c % 10 == 0){
      print "\n";
   } else {
      print "\t";
   }
}

exit(0);

norm.R:

x <- as.matrix(read.table('raw.tsv', header=F))

z_norm <- function(x){
  s <- sd(x)
  m <- mean(x)
  round((x - m)/s, 4)
}

z <- t(apply(x, 1, z_norm))
write.table(z,
            'norm.tsv',
            quote = F,
            sep = "\t",
            row.names = F,
            col.names = F)

plot.R:

x <- as.matrix(read.table('norm.tsv', header=F))

png('figure1.png')
image(x)

Now if we typed make or make all:

make
get_data.pl > raw.tsv
R CMD BATCH norm.R
R CMD BATCH plot.R

ls -1
Makefile
figure1.png
get_data.pl
norm.R
norm.Rout
norm.tsv
plot.R
plot.Rout
raw.tsv

figure1Figure 1.

One of the cool things is that if we changed one of the scripts, GNU make compares the timestamp of the script and the output file. If the output file is older than the script, GNU make will recreate the output file. All the other targets that depend on this file will also be recreated. For example, if we changed plot.R, the raw.tsv and norm.tsv won't be rerun.

Makefiles also force you to modularise your workflow, instead of having one big script. This is important in reproducible research because it makes your workflow easier to understand and to manage.

Additional notes

Makefiles work in a pull-based fashion, which means that the workflow is invoked by asking for a specific target and all tasks required for producing the file will be executed.

To see what make will run without actually running the commands, use make --dry-run.

It is good practice to tell Make that all and clean refer to targets in the Makefile and not to any files by including:

.PHONY: all clean

In Makefiles, the two variables $@ and $^ refer to the target and prerequisite of a rule.

target ($@): dependencies ($^)
        system command(s)

Use % to define a pattern rule.

See automatic variables for a list of more variables.

Further reading

Print Friendly, PDF & Email



Creative Commons License
This work is licensed under a Creative Commons
Attribution 4.0 International License
.
7 comments Add yours
  1. Hi,

    thank you for that nice and concise post on using Makefiles beyond compiling source code.
    Makefiles are very powerful tools for defining workflows of various kinds.
    For example, if you are building a BLAST database to search against, why not create a Makefile that handles the downloading and creation? That way you can easily rebuild the DB without needing to enter all the commands, e.g., if your reference sequences have been updated.
    Importantly, once a little bit of effort has been put into this, it is trivial to put these things online (e.g., code repository) and share them with others, in particular since the order of executions and required parameters is specified making it much easier to “take it from there”.

    However, Makefiles are not necessarily easy to read (I am not delving into .PHONY, Make-variables vs. bash-variables, implicit pattern rules, etc.), especially when more complex tasks are involved, and alternatives have been developed that might be easier to write and read.
    While I am not particularly experienced in the alternatives, I think it is important to highlight them too so that people interested in implementing more structure into their computations (think several individual but still related scripts lying around in a project folder) can potentially find something that suits their needs. Some noteworthy alternatives are IMO snakemake (https://bitbucket.org/johanneskoester/snakemake/wiki/Home) or bpipe (https://github.com/ssadedin/bpipe). There might be many others and posting them here could help others to find their preferred choice.

    Best,

    Cedric

    1. Hi Cedric,

      thank you for the detailed comment! The versatility of Makefiles is indeed limited by the imagination!

      And thank you for telling the other side of the story too. I guess when I learn a bit more about Makefiles, I will appreciate your comment much more. My last post was on Bpipe and indeed it was quite powerful and easy to use.

      Cheers,

      Dave

  2. One thing about Makefile, learned the hard way, that it runs its commands in a separate shell. And, by default, it uses ‘sh’ shell, making some incompatibilities with bash arithmetics, if .. fi statements, increments like k+=1 etc. Including ‘SHELL=/bin/bash’ in the beginning of a Makefile helps to have commands work as expected.

    And, $@, $< automatic variables (https://www.gnu.org/software/make/manual/html_node/Automatic-Variables.html) are among the first to learn. As well as using $$ prefix for regular variables, e.g., awk '{print $$0}'

  3. Hey,
    I really like your blog in general. Very helpful although a bit too advanced for my skills.

    Check Snakemake that was developped with the same idea that GNU Make but with a more human-friendly readable text workflow. One of the main advantage is that it is very close to Python and can interface with R, shell and many others. All Python librairies can be used (e.g. Biopython).
    https://bitbucket.org/johanneskoester/snakemake/wiki/Home

    I’ve started making my own pipelines for RNA-Seq based on neat existing ones such as this one.
    https://github.com/leipzig/snakemake-example/blob/master/Snakefile

    Hope you can make a post about Snakemake later on!

    1. Hello,

      thanks for the comment and for letting me know about Snakemake; you’re the second person to tell me about it.

      I will definitely write a post on Snakemake 🙂

      Cheers,

      Dave

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.