Paired end alignment using Bpipe

This is a continuation of my post on getting started with Bpipe. In this post I followed the paired end alignment example from the documentation, with a few adjustments. This pipeline generates a random reference sequence, generates paired end reads from the reference, and aligns these reads back to the reference using BWA.

Here’s the pipeline file (pipeline.groovy):

SEED=31
REF_SIZE=1000000
REF="test.fa"
READ_NO=10000
READ_LEN=100
INNER_DIST=400

random_ref = {
   // Usage: generate_random_seq.pl <bp> <seed>
   exec "generate_random_seq.pl $REF_SIZE $SEED > $REF"
}

index_ref = {
   exec "bwa index $REF"
}

random_read = {
   // Usage: random_paired_end.pl <infile.fa> <read length> <number of pairs> <inner mate distance> <seed>
   exec "random_paired_end.pl $REF $READ_LEN $READ_NO $INNER_DIST $SEED"
}

bwa_align = {
   from(glob("*.fq")) {
      transform("sai","sai","bam"){
         multi "bwa aln $REF $input1.fq > $output1",
               "bwa aln $REF $input2.fq > $output2"

         exec """
            bwa sampe $REF $output1 $output2 $input1.fq $input2.fq |
            samtools view -bS - |
            samtools sort - $output.bam.prefix
         """
      }
   }
}

Bpipe.run { random_ref + index_ref + random_read + bwa_align }

Here’s the generate_random_seq.pl script:


#!/usr/bin/env perl
use strict;
use warnings;
my $usage = "Usage: $0 <bp> <seed>\n";
my $num = shift or die $usage;
my $seed = shift or die $usage;
#set seed for reproducibility
srand($seed);
my $random_seq = random_seq($num);
print ">$num | $seed\n$random_seq\n";
exit(0);
sub random_seq {
my ($num) = @_;
my @nuc = qw/ A C G T /;
my $seq = '';
for (1 .. $num){
my $rand_ind = int(rand(scalar(@nuc)));
$seq .= $nuc[$rand_ind];
}
return($seq);
}
exit(0);

Here’s the random_paired_end.pl script:


#!/usr/bin/env perl
# Simple script that takes an input fasta sequence
# and generates paired end reads
use strict;
use warnings;
my $usage = "Usage: $0 <infile.fa> <read length> <number of pairs> <inner mate distance> <seed>\n";
my $fasta = shift or die $usage;
my $len = shift or die $usage;
my $num = shift or die $usage;
my $inner_mate = shift or die $usage;
my $seed = shift or die $usage;
srand($seed);
my $seq = '';
open(IN,'<',$fasta) || die "Could not open $fasta: $!\n";
while(<IN>){
chomp;
next if /^>/;
$seq .= $_;
}
close(IN);
my $seq_len = length($seq);
my $limit = $seq_len$len$len$inner_mate + 1;
if ($len > $seq_len){
die "Your read length is longer than the input sequence\n";
}
# on Illumina 1.8+ ! is the worst quality
# and J is the best
my $fake_qual = 'J' x $len;
my $name = 'l' . $len . '_' . 'n' . $num . '_' . 'd' . $inner_mate . '_' . $seed;
my $first_out = $name . '_1.fq';
my $second_out = $name . '_2.fq';
open(R1,'>',$first_out) || die "Could not open $first_out for writing: $!\n";
open(R2,'>',$second_out) || die "Could not open $second_out for writing: $!\n";
for (1 .. $num){
my $first_start = int(rand($limit));
if ($first_start > $limit){
while( $first_start > $limit ){
$first_start = int(rand($limit));
}
}
my $first_read = substr($seq,$first_start,$len);
my $first_pos = $first_start + 1;
print R1 "\@$_:$first_pos\n$first_read\n+\n$fake_qual\n";
my $second_start = $first_start + $inner_mate;
my $second_read = substr($seq,$second_start,$len);
$second_read = reverse($second_read);
$second_read =~ tr/ACGT/TGCA/;
my $second_pos = $second_start + 1;
print R2 "\@$_:$first_pos\n$second_read\n+\n$fake_qual\n";
}
close(R1);
close(R2);
exit(0);

Finally here’s the Makefile:

all:
        bpipe run pipeline.groovy

clean:
        rm -rf commandlog.txt .bpipe *.fq *.fa* *.sai *.bam

To run this pipeline, download the two Perl scripts, create the Makefile, and the pipeline.groovy as above, and put them all into the same directory and type:

make
#type make clean to delete all the generated files

[snipped]
======================================== Pipeline Succeeded ========================================
11:02:01 MSG:  Finished at Thu Jun 04 11:02:01 WST 2015
11:02:01 MSG:  Outputs are: 
                l100_n10000_d400_31_1.sai
                l100_n10000_d400_31_2.sai
                l100_n10000_d400_31_1.bam

#the final alignment file is the BAM file
#and should have the same md5 output
#md5sum on Linux
md5 l100_n10000_d400_31_1.bam
MD5 (l100_n10000_d400_31_1.bam) = a77b2b726e64210a6b10baa9fe377220

Summary

The pipeline is fairly straight-forward; the purpose of this post was to create a working example of a Bpipe pipeline for paired end alignment as I could not follow the paired end example on the Bpipe documentation. There will be more posts in the near future on using Bpipe to build different types of bioinformatics pipelines.

Print Friendly, PDF & Email



Creative Commons License
This work is licensed under a Creative Commons
Attribution 4.0 International License
.

Leave a Reply

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.