Updated 2020 February 26th to include section "Strip directory and extensions".
I wrote this short guide on using GNU parallel for my biologist buddies who would like to harness the power of parallelisation. There are a lot of really useful guides out there but here I try to give simplistic examples. Let's get started by downloading the program and compiling it:
wget http://ftp.jaist.ac.jp/pub/GNU/parallel/parallel-latest.tar.bz2 tar -xjf parallel-latest.tar.bz2 cd parallel-20131022/ #parallel will be compiled in the current directory ./configure --prefix=`pwd` && make && make install
Now adapting some of the examples from the GNU parallel tutorial:
#a basic example of running 3 echos in parallel parallel echo ::: A B C A B C
In the above example, I ran the echo command, which simply displays a line of text and used parallel to input three lines to the echo command: A, B and C. The ::: is the separator between the echo command and the input source. Now instead of using echo, I can compress files in parallel:
#touch creates 3 empty files touch a.txt b.txt c.txt #3 files are made ls a.txt b.txt c.txt #we can gzip them in parallel parallel gzip ::: a.txt b.txt c.txt ls a.txt.gz b.txt.gz c.txt.gz #instead of using the ::: notation #we can specify input as a file #gunzip the files to obtain the uncompressed files gunzip *.gz #store the names of *.txt files inside a file called my_files ls -1 *.txt > my_files #check the contents of my_files cat my_files a.txt b.txt c.txt #gzip in parallel using my_files parallel -a my_files gzip ls a.txt.gz b.txt.gz c.txt.gz my_files #and lastly parallel can also use STDIN ls *.gz | parallel echo a.txt.gz b.txt.gz c.txt.gz
You can also save a bunch of commands into a file and redirect input to GNU Parallel.
# file with commands cat to_run.txt echo 1 echo 2 echo 3 echo 4 echo 5 echo 6 echo 7 echo 8 # your output order may differ parallel < to_run.txt 1 2 3 4 5 6 7 8
You can also have multiple inputs; just use another set of ::: separators.
parallel echo ::: A B C ::: A B C A A A B A C B A B B B C C A C B C C
A word on quotes
When constructing a pipeline, especially using awk, the quotations may conflict. For example, let's say I wanted to rearrange the columns of an output using awk:
cat my_file.tsv | awk '{OFS="\t"; print $3, $2, $1}' > my_file_rearranged.tsv
I want to do this with all tsv files, so I would do something like this:
parallel "cat {} | awk '{OFS="\t"; print $3, $2, $1}' > {.}_rearranged.tsv" ::: *.tsv awk: {OFS=t; print , none, bell-style} awk: ^ syntax error awk: {OFS=t; print , none, bell-style} awk: ^ syntax error awk: {OFS=t; print , none, bell-style} awk: ^ syntax error
What we can do is store the awk line in a variable and run parallel as so:
my_awk='{OFS="\t"; print $3, $2, $1}' parallel "cat {} | awk '$my_awk' > {.}_rearranged.tsv" ::: *.tsv
GNU parallel with blat
I have shown examples of displaying lines of text and compressing files in parallel. We can apply this to some commonly used bioinformtic tools, such as blat. For example, we can run a blat job per each assembled chromsome of hg19:
parallel blat chr{}.fa human/test.fa test_{}.psl ::: {1..22} X Y M
In the above example, I have a list of numbers from 1 to 22, and the characters X, Y and M as the input to parallel (they corresponds to the assembled chromosomes of hg19). In my directory, I have each of the chromosome as fasta files (for example chr1.fa), however my input to parallel is 1. We can refer to the input to parallel by using the curly braces, {}. To illustrate this:
parallel echo chr{}.fa ::: {1..22} X Y M | head -5 chr1.fa chr2.fa chr3.fa chr4.fa chr5.fa
To see the actual blat command being run without running it, use --dryrun:
parallel --dryrun blat chr{}.fa human/test.fa test_{}.psl ::: {1..22} X Y M | head -5 blat chr1.fa human/test.fa test_1.psl blat chr2.fa human/test.fa test_2.psl blat chr3.fa human/test.fa test_3.psl blat chr4.fa human/test.fa test_4.psl blat chr5.fa human/test.fa test_5.psl
Another possibility is to split the input file, test.fa, into different chunks and blat to all chromosomes at once:
# create a fasta file with 10 entries perl -le 'for (1..10){ print ">$_\nACTG" }' > test.fa # split test.fa with one entry per file # however, be careful when splitting FASTA entries like this # because sometimes sequences inside a file may be contain newlines # here the assumption is that one FASTA entry takes up 2 lines # without any newlines in between entries split --lines=2 test.fa test_split. # split into 10 files ls -1 test_split.* test_split.aa test_split.ab test_split.ac test_split.ad test_split.ae test_split.af test_split.ag test_split.ah test_split.ai test_split.aj # save list of files in a new file ls -1 test_split.a* > my_files # run blat in parallel on files saved in my_files parallel -a my_files --dryrun blat hg19.fa {} {.}.psl blat hg19.fa test_split.aa test_split.psl blat hg19.fa test_split.ab test_split.psl blat hg19.fa test_split.ac test_split.psl blat hg19.fa test_split.ad test_split.psl blat hg19.fa test_split.ae test_split.psl blat hg19.fa test_split.af test_split.psl blat hg19.fa test_split.ag test_split.psl blat hg19.fa test_split.ah test_split.psl blat hg19.fa test_split.ai test_split.psl blat hg19.fa test_split.aj test_split.psl
GNU parallel with SAMtools
I usually have to work with several BAM files and being able to process them in parallel would save a lot of time. For this example I have 40 BAM files and I want to get some simple statistics on all the files:
parallel 'samtools flagstat {} > {.}.stat' ::: *.bam #to see what is run without running the command #use --dryrun to check if the command looks right parallel --dryrun 'samtools flagstat {} > {.}.stat' ::: *.bam | head -1 samtools flagstat c10_ACT.bam > c10_ACT.stat #use --verbose to print the command that will be run parallel --verbose 'samtools flagstat {} > {.}.stat' ::: *.bam
In the above example, I ran the samtools flagstat command, which calculates some simple statistics of a particular BAM file. I referred to each individual BAM file using the curly braces and I redirected (using >) the output of samtools flagstat to {.}.stat. The {.} removes the extension of the file, which in this case the ".bam" is removed (this is akin to running basename blah.bam .bam, which becomes blah). So the BAM file called c10_ACT.bam will have the results of flagstat stored in a file called c10_ACT.stat.
Other useful features
Please refer to the GNU parallel tutorial for more examples. Here I list a subset of the features from the tutorial.
#using 4 cores instead of all possible cores parallel -j4 'samtools flagstat {} > {.}.stat' ::: *.bam #keeping the progress by using --progress #10 sleep jobs parallel --progress sleep ::: 1 3 2 2 1 3 3 2 10 Computers / CPU cores / Max jobs to run 1:local / 12 / 9 Computer:jobs running/jobs completed/%of started jobs/Average seconds to complete local:0/9/100%/1.1s #keeping a log parallel --joblog /tmp/log 'samtools flagstat {} > {.}.stat' ::: *.bam #checking the log head /tmp/log Seq Host Starttime Runtime Send Receive Exitval Signal Command 2 : 1384752054.071 1.449 0 0 0 0 samtools flagstat c11_ATC.bam > c11_ATC.stat 10 : 1384752054.079 2.351 0 0 0 0 samtools flagstat c19_ATC.bam > c19_ATC.stat 4 : 1384752054.073 2.498 0 0 0 0 samtools flagstat c13_CTT.bam > c13_CTT.stat 11 : 1384752054.08 4.096 0 0 0 0 samtools flagstat c1_GAT.bam > c1_GAT.stat 1 : 1384752054.07 4.374 0 0 0 0 samtools flagstat c10_ACT.bam > c10_ACT.stat 8 : 1384752054.077 4.37 0 0 0 0 samtools flagstat c17_GAT.bam > c17_GAT.stat 3 : 1384752054.072 4.504 0 0 0 0 samtools flagstat c12_AGC.bam > c12_AGC.stat 7 : 1384752054.076 4.51 0 0 0 0 samtools flagstat c16_ATG.bam > c16_ATG.stat 15 : 1384752056.574 2.441 0 0 0 0 samtools flagstat c4_AGC.bam > c4_AGC.stat #processing chunks #create file with 1 million lines perl -le 'for (1 .. 1_000_000){ print $_ }' > 1000000.txt cat 1000000.txt | wc -l 1000000 cat 1000000.txt | parallel --pipe wc -l 158729 142857 142857 142857 142857 142858 126985 #change block size cat 1000000.txt | parallel --pipe --block 2M wc -l 301586 285714 285715 126985 #-k #Keep sequence of output same as the order of input. parallel -j4 sleep {}\; echo {} ::: 2 1 4 3 1 2 3 4 parallel -j4 -k sleep {}\; echo {} ::: 2 1 4 3 2 1 4 3 # removing two levels of extensions # see https://www.gnu.org/software/parallel/man.html and the section #example__removing_two_file_extensions_when_processing_files_and # the --er parameter in the man page just renames this syntax {.} with your preference touch test.tar.gz ls test.tar.gz | parallel echo {.} | parallel echo {.} test
Strip directory and extensions
Syntax as follows.
- {} - full name
- {.} - one less file extension
- {/} - remove directory path
- {//} - directory path
- {/..} - remove directory path and two levels of file extension
echo dir1/dir2/file.txt.gz | parallel --plus 'echo -e {}\\n{.}\\n{/}\\n{//}\\n{/..}' dir1/dir2/file.txt.gz dir1/dir2/file.txt file.txt.gz dir1/dir2 file
Formatting your output
There is another way to remove two levels of extensions from a file, for example, removing .tar.gz from foo.tar.gz. This method also allows you to format your output as you like. However, the example shown in the official documentation may not be clear to those who are not familiar with Perl and regular expressions. Here's the example from the documentation:
parallel echo '{= s:\.[^.]+$::;s:\.[^.]+$::; =}' ::: foo.tar.gz foo
In Perl, we can substitute text using the syntax s/a/b/; this will substitute the first a that is found and replaces it with a b. If we want to replace a with nothing, the expression would be s/a//. Conveniently (or confusingly), you can use any delimiter you wish instead of the forward slash (/); s:a:b: is the same expression. If you look closely, there are two substitutions above:
s:\.[^.]+$:: s:\.[^.]+$::
That translates to substitute text that has a period (.) in front, followed by one or more characters that are not periods ([^.]) until you reach the end ($). The first substitution will remove the .gz and the second substitution will remove the .tar:
#do only one substitution parallel echo '{= s:\.[^.]+$::; =}' ::: foo.tar.gz foo.tar #do three substitutions parallel echo '{= s:\.[^.]+$::; s:\.[^.]+$::; s/foo/bar/; =}' ::: foo.tar.gz bar
Conclusions
GNU parallel makes it incredibly easy to parallel jobs; in the past I have relied on simple forking in Perl. I have written this simple guide to illustrate its ease of use and how it can be applicable to the daily tasks of anyone working in the genomics field (as long as you have access to a computer with multiple cores). You can simply adapt the simple examples above to your own bioinformatic tool. Mapping sequences to genomes can be parallelised since the genome is conveniently split up by chromosomes. If your input can be split up into independent chunks, such as independent reads from a sequencer, the job can be parallelised. Here's a more comprehensive tutorial on using GNU parallel with other bioinformtic tools written by the developer of GNU parallel. For more advanced usage examples of GNU parallel check out the documentation.
And here's the citation of GNU parallel (kudos to him for this awesome tool!):
O. Tange (2011): GNU Parallel - The Command-Line Power Tool, The USENIX Magazine, February 2011:42-47.

This work is licensed under a Creative Commons
Attribution 4.0 International License.
Look at dymnamic replacement strings today:
parallel –plus echo {%.tar.gz} ::: foo.tar.gz
You can even make your own. Remove this many file extensions:
parallel –rpl ‘{ext=(.*)} for my $i (1..$$1){s:\.[^/.]+$::;}’ echo {ext=3} {ext=2} {ext=1} ::: foo.beta.tar.gz
I have a bash script as follows. Is it possible to run that script in gnu parallel?
#!/bin/bash
for i in *.bam; do
gatk –java-options “-Xmx40g” HaplotypeCaller -R /media/gatk/Homo_sapiens_assembly38.fasta -I $i -O $i.vcf.gz –dbsnp /media/gatk/dbsnp_138.hg38.vcf.gz -L /media/gatk/target/ag_V6_list.interval_list; done
I recommend using scattering on an interval list that has been divided; you can use IntervalListTools either from Picard or GATK4.