This is a continuation of my post on getting started with Bpipe. In this post I followed the paired end alignment example from the documentation, with a few adjustments. This pipeline generates a random reference sequence, generates paired end reads from the reference, and aligns these reads back to the reference using BWA.
Here's the pipeline file (pipeline.groovy):
SEED=31 REF_SIZE=1000000 REF="test.fa" READ_NO=10000 READ_LEN=100 INNER_DIST=400 random_ref = { // Usage: generate_random_seq.pl <bp> <seed> exec "generate_random_seq.pl $REF_SIZE $SEED > $REF" } index_ref = { exec "bwa index $REF" } random_read = { // Usage: random_paired_end.pl <infile.fa> <read length> <number of pairs> <inner mate distance> <seed> exec "random_paired_end.pl $REF $READ_LEN $READ_NO $INNER_DIST $SEED" } bwa_align = { from(glob("*.fq")) { transform("sai","sai","bam"){ multi "bwa aln $REF $input1.fq > $output1", "bwa aln $REF $input2.fq > $output2" exec """ bwa sampe $REF $output1 $output2 $input1.fq $input2.fq | samtools view -bS - | samtools sort - $output.bam.prefix """ } } } Bpipe.run { random_ref + index_ref + random_read + bwa_align }
Here's the generate_random_seq.pl script:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/usr/bin/env perl | |
use strict; | |
use warnings; | |
my $usage = "Usage: $0 <bp> <seed>\n"; | |
my $num = shift or die $usage; | |
my $seed = shift or die $usage; | |
#set seed for reproducibility | |
srand($seed); | |
my $random_seq = random_seq($num); | |
print ">$num | $seed\n$random_seq\n"; | |
exit(0); | |
sub random_seq { | |
my ($num) = @_; | |
my @nuc = qw/ A C G T /; | |
my $seq = ''; | |
for (1 .. $num){ | |
my $rand_ind = int(rand(scalar(@nuc))); | |
$seq .= $nuc[$rand_ind]; | |
} | |
return($seq); | |
} | |
exit(0); |
Here's the random_paired_end.pl script:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/usr/bin/env perl | |
# Simple script that takes an input fasta sequence | |
# and generates paired end reads | |
use strict; | |
use warnings; | |
my $usage = "Usage: $0 <infile.fa> <read length> <number of pairs> <inner mate distance> <seed>\n"; | |
my $fasta = shift or die $usage; | |
my $len = shift or die $usage; | |
my $num = shift or die $usage; | |
my $inner_mate = shift or die $usage; | |
my $seed = shift or die $usage; | |
srand($seed); | |
my $seq = ''; | |
open(IN,'<',$fasta) || die "Could not open $fasta: $!\n"; | |
while(<IN>){ | |
chomp; | |
next if /^>/; | |
$seq .= $_; | |
} | |
close(IN); | |
my $seq_len = length($seq); | |
my $limit = $seq_len - $len -$len - $inner_mate + 1; | |
if ($len > $seq_len){ | |
die "Your read length is longer than the input sequence\n"; | |
} | |
# on Illumina 1.8+ ! is the worst quality | |
# and J is the best | |
my $fake_qual = 'J' x $len; | |
my $name = 'l' . $len . '_' . 'n' . $num . '_' . 'd' . $inner_mate . '_' . $seed; | |
my $first_out = $name . '_1.fq'; | |
my $second_out = $name . '_2.fq'; | |
open(R1,'>',$first_out) || die "Could not open $first_out for writing: $!\n"; | |
open(R2,'>',$second_out) || die "Could not open $second_out for writing: $!\n"; | |
for (1 .. $num){ | |
my $first_start = int(rand($limit)); | |
if ($first_start > $limit){ | |
while( $first_start > $limit ){ | |
$first_start = int(rand($limit)); | |
} | |
} | |
my $first_read = substr($seq,$first_start,$len); | |
my $first_pos = $first_start + 1; | |
print R1 "\@$_:$first_pos\n$first_read\n+\n$fake_qual\n"; | |
my $second_start = $first_start + $inner_mate; | |
my $second_read = substr($seq,$second_start,$len); | |
$second_read = reverse($second_read); | |
$second_read =~ tr/ACGT/TGCA/; | |
my $second_pos = $second_start + 1; | |
print R2 "\@$_:$first_pos\n$second_read\n+\n$fake_qual\n"; | |
} | |
close(R1); | |
close(R2); | |
exit(0); |
Finally here's the Makefile:
all: bpipe run pipeline.groovy clean: rm -rf commandlog.txt .bpipe *.fq *.fa* *.sai *.bam
To run this pipeline, download the two Perl scripts, create the Makefile, and the pipeline.groovy as above, and put them all into the same directory and type:
make #type make clean to delete all the generated files [snipped] ======================================== Pipeline Succeeded ======================================== 11:02:01 MSG: Finished at Thu Jun 04 11:02:01 WST 2015 11:02:01 MSG: Outputs are: l100_n10000_d400_31_1.sai l100_n10000_d400_31_2.sai l100_n10000_d400_31_1.bam #the final alignment file is the BAM file #and should have the same md5 output #md5sum on Linux md5 l100_n10000_d400_31_1.bam MD5 (l100_n10000_d400_31_1.bam) = a77b2b726e64210a6b10baa9fe377220
Summary
The pipeline is fairly straight-forward; the purpose of this post was to create a working example of a Bpipe pipeline for paired end alignment as I could not follow the paired end example on the Bpipe documentation. There will be more posts in the near future on using Bpipe to build different types of bioinformatics pipelines.

This work is licensed under a Creative Commons
Attribution 4.0 International License.