Using GNU parallel

Updated 2020 February 26th to include section "Strip directory and extensions".

I wrote this short guide on using GNU parallel for my biologist buddies who would like to harness the power of parallelisation. There are a lot of really useful guides out there but here I try to give simplistic examples. Let's get started by downloading the program and compiling it:

wget http://ftp.jaist.ac.jp/pub/GNU/parallel/parallel-latest.tar.bz2
tar -xjf parallel-latest.tar.bz2
cd parallel-20131022/
#parallel will be compiled in the current directory
./configure --prefix=`pwd` && make && make install

Now adapting some of the examples from the GNU parallel tutorial:

#a basic example of running 3 echos in parallel
parallel echo ::: A B C
A
B
C

In the above example, I ran the echo command, which simply displays a line of text and used parallel to input three lines to the echo command: A, B and C. The ::: is the separator between the echo command and the input source. Now instead of using echo, I can compress files in parallel:

#touch creates 3 empty files
touch a.txt b.txt c.txt
#3 files are made
ls
a.txt  b.txt  c.txt

#we can gzip them in parallel
parallel gzip ::: a.txt b.txt c.txt
ls
a.txt.gz  b.txt.gz  c.txt.gz

#instead of using the ::: notation
#we can specify input as a file
#gunzip the files to obtain the uncompressed files
gunzip *.gz
#store the names of *.txt files inside a file called my_files
ls -1 *.txt > my_files
#check the contents of my_files
cat my_files
a.txt
b.txt
c.txt

#gzip in parallel using my_files
parallel -a my_files gzip
ls
a.txt.gz  b.txt.gz  c.txt.gz  my_files

#and lastly parallel can also use STDIN
ls *.gz | parallel echo
a.txt.gz
b.txt.gz
c.txt.gz

You can also save a bunch of commands into a file and redirect input to GNU Parallel.

# file with commands
cat to_run.txt 
echo 1
echo 2
echo 3
echo 4
echo 5
echo 6
echo 7
echo 8

# your output order may differ
parallel < to_run.txt 
1
2
3
4
5
6
7
8

You can also have multiple inputs; just use another set of ::: separators.

parallel echo ::: A B C ::: A B C
A A
A B
A C
B A
B B
B C
C A
C B
C C

A word on quotes

When constructing a pipeline, especially using awk, the quotations may conflict. For example, let's say I wanted to rearrange the columns of an output using awk:

cat my_file.tsv | awk '{OFS="\t"; print $3, $2, $1}' > my_file_rearranged.tsv

I want to do this with all tsv files, so I would do something like this:

parallel "cat {} | awk '{OFS="\t"; print $3, $2, $1}' > {.}_rearranged.tsv" ::: *.tsv
awk: {OFS=t; print , none, bell-style}
awk:               ^ syntax error
awk: {OFS=t; print , none, bell-style}
awk:               ^ syntax error
awk: {OFS=t; print , none, bell-style}
awk:               ^ syntax error

What we can do is store the awk line in a variable and run parallel as so:

my_awk='{OFS="\t"; print $3, $2, $1}'
parallel "cat {} | awk '$my_awk' > {.}_rearranged.tsv" ::: *.tsv

GNU parallel with blat

I have shown examples of displaying lines of text and compressing files in parallel. We can apply this to some commonly used bioinformtic tools, such as blat. For example, we can run a blat job per each assembled chromsome of hg19:

parallel blat chr{}.fa human/test.fa test_{}.psl ::: {1..22} X Y M

In the above example, I have a list of numbers from 1 to 22, and the characters X, Y and M as the input to parallel (they corresponds to the assembled chromosomes of hg19). In my directory, I have each of the chromosome as fasta files (for example chr1.fa), however my input to parallel is 1. We can refer to the input to parallel by using the curly braces, {}. To illustrate this:

parallel echo chr{}.fa ::: {1..22} X Y M | head -5
chr1.fa
chr2.fa
chr3.fa
chr4.fa
chr5.fa

To see the actual blat command being run without running it, use --dryrun:

parallel --dryrun blat chr{}.fa human/test.fa test_{}.psl ::: {1..22} X Y M | head -5
blat chr1.fa human/test.fa test_1.psl
blat chr2.fa human/test.fa test_2.psl
blat chr3.fa human/test.fa test_3.psl
blat chr4.fa human/test.fa test_4.psl
blat chr5.fa human/test.fa test_5.psl

Another possibility is to split the input file, test.fa, into different chunks and blat to all chromosomes at once:

# create a fasta file with 10 entries
perl -le 'for (1..10){ print ">$_\nACTG" }' > test.fa

# split test.fa with one entry per file
# however, be careful when splitting FASTA entries like this
# because sometimes sequences inside a file may be contain newlines
# here the assumption is that one FASTA entry takes up 2 lines
# without any newlines in between entries
split --lines=2 test.fa test_split.

# split into 10 files
ls -1 test_split.*
test_split.aa
test_split.ab
test_split.ac
test_split.ad
test_split.ae
test_split.af
test_split.ag
test_split.ah
test_split.ai
test_split.aj

# save list of files in a new file
ls -1 test_split.a* > my_files

# run blat in parallel on files saved in my_files
parallel -a my_files --dryrun blat hg19.fa {} {.}.psl
blat hg19.fa test_split.aa test_split.psl
blat hg19.fa test_split.ab test_split.psl
blat hg19.fa test_split.ac test_split.psl
blat hg19.fa test_split.ad test_split.psl
blat hg19.fa test_split.ae test_split.psl
blat hg19.fa test_split.af test_split.psl
blat hg19.fa test_split.ag test_split.psl
blat hg19.fa test_split.ah test_split.psl
blat hg19.fa test_split.ai test_split.psl
blat hg19.fa test_split.aj test_split.psl

GNU parallel with SAMtools

I usually have to work with several BAM files and being able to process them in parallel would save a lot of time. For this example I have 40 BAM files and I want to get some simple statistics on all the files:

parallel 'samtools flagstat {} > {.}.stat' ::: *.bam
#to see what is run without running the command
#use --dryrun to check if the command looks right
parallel --dryrun 'samtools flagstat {} > {.}.stat' ::: *.bam | head -1
samtools flagstat c10_ACT.bam > c10_ACT.stat
#use --verbose to print the command that will be run
parallel --verbose 'samtools flagstat {} > {.}.stat' ::: *.bam

In the above example, I ran the samtools flagstat command, which calculates some simple statistics of a particular BAM file. I referred to each individual BAM file using the curly braces and I redirected (using >) the output of samtools flagstat to {.}.stat. The {.} removes the extension of the file, which in this case the ".bam" is removed (this is akin to running basename blah.bam .bam, which becomes blah). So the BAM file called c10_ACT.bam will have the results of flagstat stored in a file called c10_ACT.stat.

Other useful features

Please refer to the GNU parallel tutorial for more examples. Here I list a subset of the features from the tutorial.

#using 4 cores instead of all possible cores
parallel -j4 'samtools flagstat {} > {.}.stat' ::: *.bam

#keeping the progress by using --progress
#10 sleep jobs
parallel --progress sleep ::: 1 3 2 2 1 3 3 2 10

Computers / CPU cores / Max jobs to run
1:local / 12 / 9

Computer:jobs running/jobs completed/%of started jobs/Average seconds to complete
local:0/9/100%/1.1s

#keeping a log
parallel --joblog /tmp/log 'samtools flagstat {} > {.}.stat' ::: *.bam
#checking the log
head /tmp/log
Seq     Host    Starttime       Runtime Send    Receive Exitval Signal  Command
2       :       1384752054.071  1.449   0       0       0       0       samtools flagstat c11_ATC.bam > c11_ATC.stat
10      :       1384752054.079  2.351   0       0       0       0       samtools flagstat c19_ATC.bam > c19_ATC.stat
4       :       1384752054.073  2.498   0       0       0       0       samtools flagstat c13_CTT.bam > c13_CTT.stat
11      :       1384752054.08   4.096   0       0       0       0       samtools flagstat c1_GAT.bam > c1_GAT.stat
1       :       1384752054.07   4.374   0       0       0       0       samtools flagstat c10_ACT.bam > c10_ACT.stat
8       :       1384752054.077  4.37    0       0       0       0       samtools flagstat c17_GAT.bam > c17_GAT.stat
3       :       1384752054.072  4.504   0       0       0       0       samtools flagstat c12_AGC.bam > c12_AGC.stat
7       :       1384752054.076  4.51    0       0       0       0       samtools flagstat c16_ATG.bam > c16_ATG.stat
15      :       1384752056.574  2.441   0       0       0       0       samtools flagstat c4_AGC.bam > c4_AGC.stat

#processing chunks
#create file with 1 million lines
perl -le 'for (1 .. 1_000_000){ print $_ }' > 1000000.txt
cat 1000000.txt | wc -l
1000000
cat 1000000.txt | parallel --pipe wc -l
158729
142857
142857
142857
142857
142858
126985
#change block size
cat 1000000.txt | parallel --pipe --block 2M wc -l
301586
285714
285715
126985

#-k
#Keep sequence of output same as the order of input.
parallel -j4 sleep {}\; echo {} ::: 2 1 4 3
1
2
3
4
parallel -j4 -k sleep {}\; echo {} ::: 2 1 4 3
2
1
4
3

# removing two levels of extensions
# see https://www.gnu.org/software/parallel/man.html and the section #example__removing_two_file_extensions_when_processing_files_and
# the --er parameter in the man page just renames this syntax {.} with your preference
touch test.tar.gz
ls test.tar.gz | parallel echo {.} | parallel echo {.}
test

Strip directory and extensions

Syntax as follows.

  • {} - full name
  • {.} - one less file extension
  • {/} - remove directory path
  • {//} - directory path
  • {/..} - remove directory path and two levels of file extension
echo dir1/dir2/file.txt.gz | parallel --plus 'echo -e {}\\n{.}\\n{/}\\n{//}\\n{/..}'
dir1/dir2/file.txt.gz
dir1/dir2/file.txt
file.txt.gz
dir1/dir2
file

Formatting your output

There is another way to remove two levels of extensions from a file, for example, removing .tar.gz from foo.tar.gz. This method also allows you to format your output as you like. However, the example shown in the official documentation may not be clear to those who are not familiar with Perl and regular expressions. Here's the example from the documentation:

parallel echo '{= s:\.[^.]+$::;s:\.[^.]+$::; =}' ::: foo.tar.gz
foo

In Perl, we can substitute text using the syntax s/a/b/; this will substitute the first a that is found and replaces it with a b. If we want to replace a with nothing, the expression would be s/a//. Conveniently (or confusingly), you can use any delimiter you wish instead of the forward slash (/); s:a:b: is the same expression. If you look closely, there are two substitutions above:

s:\.[^.]+$::
s:\.[^.]+$::

That translates to substitute text that has a period (.) in front, followed by one or more characters that are not periods ([^.]) until you reach the end ($). The first substitution will remove the .gz and the second substitution will remove the .tar:

#do only one substitution
parallel echo '{= s:\.[^.]+$::; =}' ::: foo.tar.gz
foo.tar

#do three substitutions
parallel echo '{= s:\.[^.]+$::; s:\.[^.]+$::; s/foo/bar/; =}' ::: foo.tar.gz
bar

Conclusions

GNU parallel makes it incredibly easy to parallel jobs; in the past I have relied on simple forking in Perl. I have written this simple guide to illustrate its ease of use and how it can be applicable to the daily tasks of anyone working in the genomics field (as long as you have access to a computer with multiple cores). You can simply adapt the simple examples above to your own bioinformatic tool. Mapping sequences to genomes can be parallelised since the genome is conveniently split up by chromosomes. If your input can be split up into independent chunks, such as independent reads from a sequencer, the job can be parallelised. Here's a more comprehensive tutorial on using GNU parallel with other bioinformtic tools written by the developer of GNU parallel. For more advanced usage examples of GNU parallel check out the documentation.

And here's the citation of GNU parallel (kudos to him for this awesome tool!):

O. Tange (2011): GNU Parallel - The Command-Line Power Tool, The USENIX Magazine, February 2011:42-47.




Creative Commons License
This work is licensed under a Creative Commons
Attribution 4.0 International License
.
5 comments Add yours
  1. Look at dymnamic replacement strings today:

    parallel –plus echo {%.tar.gz} ::: foo.tar.gz

    You can even make your own. Remove this many file extensions:

    parallel –rpl ‘{ext=(.*)} for my $i (1..$$1){s:\.[^/.]+$::;}’ echo {ext=3} {ext=2} {ext=1} ::: foo.beta.tar.gz

  2. I have a bash script as follows. Is it possible to run that script in gnu parallel?

    #!/bin/bash

    for i in *.bam; do
    gatk –java-options “-Xmx40g” HaplotypeCaller -R /media/gatk/Homo_sapiens_assembly38.fasta -I $i -O $i.vcf.gz –dbsnp /media/gatk/dbsnp_138.hg38.vcf.gz -L /media/gatk/target/ag_V6_list.interval_list; done

    1. I recommend using scattering on an interval list that has been divided; you can use IntervalListTools either from Picard or GATK4.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.