Phylogenetic profiling

On my wiki I have a short summary of phylogenetic profiling. The program MrBayes is used for Bayesian inference for phylogeny and can be used for inferring relationships using binary type data such as phylogenetic profiles.

The input to MrBayes is a NEXUS file and here is the example I will use:

#NEXUS
begin data;
  dimensions ntax=10 nchar=10;
  format datatype=restriction interleave=no gap=-;
  matrix
  homer 1100000000
  marge 1100000000
  sheldon 0011000000
  amy 0011000000
  lily 0000110000
  marshall 0000110000
  wilma 0000001100
  fred 0000001100
  scooby 0000000011
  shaggy 0000000011
  ;
end;

The input to MrBayes (for more information on the commands refer to the MrBayes manual).


set autoclose=yes nowarn=yes
execute test.nex
mcmc Nchains=8 Ngen=1000000 Temp=0.100000
sump burnin=5000
sumt burnin=5000
quit

MrBayes outputs consensus trees (*.con) in the Newick format that can be visualised using TreeView.

This may be useful for converting large datasets into binary, although losing information, to observe whether any relationships exist.

Saving disk space with Perl

Disk space is cheaper these days but here's one way of using less disk space by working directly with gzipped files. Here's a very straight forward example of Perl code that opens a gzipped file and outputs a gzipped file.


#!/usr/bin/perl

use strict;
use warnings;

my $infile = 'test.txt.gz';
#the three argument open is the preferred way
open(IN,'-|',"gunzip -c $infile") || die "Could not open $infile: $!\n";

my $outfile = 'test.out.gz';
open(OUT,'|-',"gzip >$outfile") || die "Could not gzip $outfile: $!\n";

while(<IN>){
   chomp;
   print OUT "$_\n";
}
close(IN);
close(OUT);

exit(0);

__END__

And here's some other code that just counts the number of lines in a file, when gzipped and when it is not gzipped.

#!/usr/bin/perl

use strict;
use warnings;

my $infile = 'big_whoop';
open(IN,'<',$infile) || die "Could not open $infile: $!\n";

#my $infile = 'big_whoop.gz';
#open(IN,'-|',"gunzip -c $infile") || die "Could not open $infile: $!\n";

my $line_count = '0';
while(<IN>){
   chomp;
   ++$line_count;
}
close(IN);

print "$line_count\n";

exit(0);

__END__

Using time, the difference between working with a gzipped and not gzipped file when counting ~7.3 million lines:

Gzipped result:

7320248

real 0m3.725s
user 0m5.313s
sys 0m0.844s

Not gzipped:

7320248

real 0m2.481s
user 0m2.151s
sys 0m0.328s

From SRA to FASTQ for SOLiD data

Before I forget how to do this (again), and waste time scouring the web, I'll make it a post. The last time I checked, the SRA will not be continued, however, there still exists a large repository of data that one may wish to (re)analyse. And sometimes the sequencing data of interest was performed on the SOLiD platform meaning that the sequence information will be in colour space. Not all bioinformatic tools support colour space, so here's how to convert the sequence information into the more widely used FASTQ format.

So firstly download the SRA files from your closest mirror, which in my case is DDBJ. Then download the SRA Toolkit. Then run:


abi-dump -A test SRR012345.lite.sra

Heng Li has written a Perl script that converts SOLiD csfasta files into fastq files. It is included in the scripts directory of the MAQ software package. I have included it here for convenience:


#!/usr/bin/perl

# Author: lh3
# Note: Ideally, this script should be written in C. It is a bit slow at present.
# Also note that this script is different from the one contained in MAQ.

use strict;
use warnings;
use diagnostics;
use Getopt::Std;

my %opts;
my $version = '0.1.4';
my $usage = qq{
Usage: solid2fastq.pl <in.title> <out.prefix>

Note: <in.title> is the string showed in the `# Title:' line of a
      ".csfasta" read file. Then <in.title>F3.csfasta is read sequence
      file and <in.title>F3_QV.qual is the quality file. If
      <in.title>R3.csfasta is present, this script assumes reads are
      paired; otherwise reads will be regarded as single-end.

      The read name will be <out.prefix>:panel_x_y/[12] with `1' for R3
      tag and `2' for F3. Usually you may want to use short <out.prefix>
      to save diskspace. Long <out.prefix> also causes troubles to maq.

};

getopts('', \%opts);
die($usage) if (@ARGV != 2);
my ($title, $pre) = @ARGV;
my (@fhr, @fhw);
my @fn_suff = ('F3.csfasta', 'F3_QV.qual', 'R3.csfasta', 'R3_QV.qual');
my $is_paired = (-f "$title$fn_suff[2]" || -f "$title$fn_suff[2].gz")? 1 : 0;
if ($is_paired) { # paired end
  for (0 .. 3) {
        my $fn = "$title$fn_suff[$_]";
        $fn = "gzip -dc $fn.gz |" if (!-f $fn && -f "$fn.gz");
        open($fhr[$_], $fn) || die("** Fail to open '$fn'.\n");
  }
  open($fhw[0], "|gzip >$pre.read2.fastq.gz") || die; # this is NOT a typo
  open($fhw[1], "|gzip >$pre.read1.fastq.gz") || die;
  open($fhw[2], "|gzip >$pre.single.fastq.gz") || die;
  my (@df, @dr);
  @df = &read1(1); @dr = &read1(2);
  while (@df && @dr) {
        if ($df[0] eq $dr[0]) { # mate pair
          print {$fhw[0]} $df[1]; print {$fhw[1]} $dr[1];
          @df = &read1(1); @dr = &read1(2);
        } else {
          if ($df[0] le $dr[0]) {
                print {$fhw[2]} $df[1];
                @df = &read1(1);
          } else {
                print {$fhw[2]} $dr[1];
                @dr = &read1(2);
          }
        }
  }
  if (@df) {
        print {$fhw[2]} $df[1];
        while (@df = &read1(1, $fhr[0], $fhr[1])) {
          print {$fhw[2]} $df[1];
        }
  }
  if (@dr) {
        print {$fhw[2]} $dr[1];
        while (@dr = &read1(2, $fhr[2], $fhr[3])) {
          print {$fhw[2]} $dr[1];
        }
  }
  close($fhr[$_]) for (0 .. $#fhr);
  close($fhw[$_]) for (0 .. $#fhw);
} else { # single end
  for (0 .. 1) {
        my $fn = "$title$fn_suff[$_]";
        $fn = "gzip -dc $fn.gz |" if (!-f $fn && -f "$fn.gz");
        open($fhr[$_], $fn) || die("** Fail to open '$fn'.\n");
  }
  open($fhw[2], "|gzip >$pre.single.fastq.gz") || die;
  my @df;
  while (@df = &read1(1, $fhr[0], $fhr[1])) {
        print {$fhw[2]} $df[1];
  }
  close($fhr[$_]) for (0 .. $#fhr);
  close($fhw[2]);
}

sub read1 {
  my $i = shift(@_);
  my $j = ($i-1)<<1;
  my ($key, $seq);
  my ($fhs, $fhq) = ($fhr[$j], $fhr[$j|1]);
  while (<$fhs>) {
        my $t = <$fhq>;
        if (/^>[A-Za-z0-9.]+\s+(\d+)_(\d+)_(\d+)_[FR]3/) {
          $key = sprintf("%.4d_%.4d_%.4d", $1, $2, $3); # this line could be improved on 64-bit machines
          die(qq/** unmatched read name: '$_' != '$_'\n/) unless ($_ eq $t);
          my $name = "$pre:$1_$2_$3/$i";
          $_ = substr(<$fhs>, 2);
          tr/0123./ACGTN/;
          my $s = $_;
          $_ = <$fhq>;
          s/-1\b/0/eg;
          s/^(\d+)\s*//;
          s/(\d+)\s*/chr($1+33)/eg;
          $seq = qq/\@$name\n$s+\n$_\n/;
          last;
        }
  }
  return defined($seq)? ($key, $seq) : ();
}

Then just run:


solid2fastq.pl test_ test2

If everything went well in the end you should have a gzipped fastq file named test2.single.fastq.gz.

And please read the comments for this post, to see why this is a bad idea.