From SRA to FASTQ for SOLiD data

Before I forget how to do this (again), and waste time scouring the web, I’ll make it a post. The last time I checked, the SRA will not be continued, however, there still exists a large repository of data that one may wish to (re)analyse. And sometimes the sequencing data of interest was performed on the SOLiD platform meaning that the sequence information will be in colour space. Not all bioinformatic tools support colour space, so here’s how to convert the sequence information into the more widely used FASTQ format.

So firstly download the SRA files from your closest mirror, which in my case is DDBJ. Then download the SRA Toolkit. Then run:


abi-dump -A test SRR012345.lite.sra

Heng Li has written a Perl script that converts SOLiD csfasta files into fastq files. It is included in the scripts directory of the MAQ software package. I have included it here for convenience:


#!/usr/bin/perl

# Author: lh3
# Note: Ideally, this script should be written in C. It is a bit slow at present.
# Also note that this script is different from the one contained in MAQ.

use strict;
use warnings;
use diagnostics;
use Getopt::Std;

my %opts;
my $version = '0.1.4';
my $usage = qq{
Usage: solid2fastq.pl <in.title> <out.prefix>

Note: <in.title> is the string showed in the `# Title:' line of a
      ".csfasta" read file. Then <in.title>F3.csfasta is read sequence
      file and <in.title>F3_QV.qual is the quality file. If
      <in.title>R3.csfasta is present, this script assumes reads are
      paired; otherwise reads will be regarded as single-end.

      The read name will be <out.prefix>:panel_x_y/[12] with `1' for R3
      tag and `2' for F3. Usually you may want to use short <out.prefix>
      to save diskspace. Long <out.prefix> also causes troubles to maq.

};

getopts('', \%opts);
die($usage) if (@ARGV != 2);
my ($title, $pre) = @ARGV;
my (@fhr, @fhw);
my @fn_suff = ('F3.csfasta', 'F3_QV.qual', 'R3.csfasta', 'R3_QV.qual');
my $is_paired = (-f "$title$fn_suff[2]" || -f "$title$fn_suff[2].gz")? 1 : 0;
if ($is_paired) { # paired end
  for (0 .. 3) {
        my $fn = "$title$fn_suff[$_]";
        $fn = "gzip -dc $fn.gz |" if (!-f $fn && -f "$fn.gz");
        open($fhr[$_], $fn) || die("** Fail to open '$fn'.\n");
  }
  open($fhw[0], "|gzip >$pre.read2.fastq.gz") || die; # this is NOT a typo
  open($fhw[1], "|gzip >$pre.read1.fastq.gz") || die;
  open($fhw[2], "|gzip >$pre.single.fastq.gz") || die;
  my (@df, @dr);
  @df = &read1(1); @dr = &read1(2);
  while (@df && @dr) {
        if ($df[0] eq $dr[0]) { # mate pair
          print {$fhw[0]} $df[1]; print {$fhw[1]} $dr[1];
          @df = &read1(1); @dr = &read1(2);
        } else {
          if ($df[0] le $dr[0]) {
                print {$fhw[2]} $df[1];
                @df = &read1(1);
          } else {
                print {$fhw[2]} $dr[1];
                @dr = &read1(2);
          }
        }
  }
  if (@df) {
        print {$fhw[2]} $df[1];
        while (@df = &read1(1, $fhr[0], $fhr[1])) {
          print {$fhw[2]} $df[1];
        }
  }
  if (@dr) {
        print {$fhw[2]} $dr[1];
        while (@dr = &read1(2, $fhr[2], $fhr[3])) {
          print {$fhw[2]} $dr[1];
        }
  }
  close($fhr[$_]) for (0 .. $#fhr);
  close($fhw[$_]) for (0 .. $#fhw);
} else { # single end
  for (0 .. 1) {
        my $fn = "$title$fn_suff[$_]";
        $fn = "gzip -dc $fn.gz |" if (!-f $fn && -f "$fn.gz");
        open($fhr[$_], $fn) || die("** Fail to open '$fn'.\n");
  }
  open($fhw[2], "|gzip >$pre.single.fastq.gz") || die;
  my @df;
  while (@df = &read1(1, $fhr[0], $fhr[1])) {
        print {$fhw[2]} $df[1];
  }
  close($fhr[$_]) for (0 .. $#fhr);
  close($fhw[2]);
}

sub read1 {
  my $i = shift(@_);
  my $j = ($i-1)<<1;
  my ($key, $seq);
  my ($fhs, $fhq) = ($fhr[$j], $fhr[$j|1]);
  while (<$fhs>) {
        my $t = <$fhq>;
        if (/^>[A-Za-z0-9.]+\s+(\d+)_(\d+)_(\d+)_[FR]3/) {
          $key = sprintf("%.4d_%.4d_%.4d", $1, $2, $3); # this line could be improved on 64-bit machines
          die(qq/** unmatched read name: '$_' != '$_'\n/) unless ($_ eq $t);
          my $name = "$pre:$1_$2_$3/$i";
          $_ = substr(<$fhs>, 2);
          tr/0123./ACGTN/;
          my $s = $_;
          $_ = <$fhq>;
          s/-1\b/0/eg;
          s/^(\d+)\s*//;
          s/(\d+)\s*/chr($1+33)/eg;
          $seq = qq/\@$name\n$s+\n$_\n/;
          last;
        }
  }
  return defined($seq)? ($key, $seq) : ();
}

Then just run:


solid2fastq.pl test_ test2

If everything went well in the end you should have a gzipped fastq file named test2.single.fastq.gz.

And please read the comments for this post, to see why this is a bad idea.

Print Friendly, PDF & Email



Creative Commons License
This work is licensed under a Creative Commons
Attribution 4.0 International License
.
18 comments Add yours
  1. Dave, your not converting CS data into basespace data are you? I hope you meant that your converting CSfasta and CSqual into CSfastq…

    As you well know, it is not possible to convert csfasta data into basespace data… well it is, but its not even remotely a good idea.

    1. Hi Jason,

      Thanks for the comment! Great to get feedback from someone who is experienced with SOLiD data. Several people on SEQanswers have mentioned that it is a really bad idea to jump from colourspace to basespace. Here’s a really nice summary: http://seqanswers.com/forums/showpost.php?p=59156&postcount=4

      I guess the next thing I’ll do is compare the results from converting colourspace to basespace with the results of changing my reference into colourspace and mapping the SOLiD colourspace fastq to this reference.

      Cheers,

      Dave

      1. Sorry Dave, I (clearly erroneously) figured you already knew that you couldnt necessarily convert directly from CS to BS.

        a single error in the CS will completely change the BS output, as that seqanswers post shows. Even if it does map, its not what you sequenced. Many of the aligners handle CS data. I think that SHRiMP 2 is the best of them, but as I remmeber bwa has a cs mode as well.

        cheers

        1. Hi Jason,

          Don’t work much with SOLiD data 😉 I’ll go ahead and make the comparison anyway just to see how different (and bad) the results will be.

          I did play with SHRiMP before, and now with your recommendation, I’ll try the sequel.

          Thanks!

          Dave

          1. Hey Dave, Dave here, BFAST is good for aligning colourspace data. While it might be interesting to see the difference in converted basespace vs native colourspace alignments, any statistics you gain from this will be largely determined by sequence quality for the specific data set, and not generalisable to other data sets.

            1. Hey mate!

              Good to hear from you. Most papers I’ve seen (which is definitely not a lot), have used SHRiMP/2 to align their SOLiD data, which was why I went with that. If I have to work with colourspace data again, I’ll try BFAST. After reading a bit on SEQanswers, I realised converting to basespace was just a plain bad idea, so I didn’t go forward with comparing basespace to colourspace. Thanks for the comments and hope you’re doing well!

              Cheers,

              Dave

  2. While converting from ColourSpace to BaseSpace is a bad idea, converting from SOLid output files (.csfasta & .qual) into an integrated .fastq file is not the same thing. This should be clear to everyone reading this very informative thread. Converting the former two files into a single .csfastq file will simply mean that you will have both the sequence read and their qualities all in one, just as you do in a .fastq file, also with phred qualities instead of the number format used in the SOLid .qual. You do not necessarily need to convert form colourspace into basespace when doing this, albeit the script above does in fact do it. Having a single .csfastq file is necessary for some aligners which do not take both .qual and .csfasta files. Just thought it would be useful to clear that up for future readers. 🙂

  3. I am trying to use the below command but in vain…..I also used the above command mentioned by you but i seem to getting the same output error for both.

    any ADVICE thank you

    alok@alok-Vaio:~/sratoolkit.2.3.2-5-ubuntu64$ fastq-dump SRR030257.sra
    2013-07-08T15:59:12 fastq-dump.2.3.2 err: name not found while resolving tree within virtual file system module – failed to open ‘SRR030257.sra’
    Written 0 spots total

  4. Hi Dave,

    I was doing some analysis on a set of public data sets which were a mixture of ABI-Solid and Illumina data. Although I have analyzed Illumina data a number of times for RNA-seq a number of times and have the pipeline ready. I wanted to include the ABI-Solid data too. So I went ahead and used R package SRAdb to directly download the csfastq files for that dataset. When I analysed these with fastqc I got pretty bad quality plots as shown in the image posted as a link below.

    I have come across a lot of posts where people have recommended different tools like bfast and shrimp which do well with color space data. But If the data has adapter sequences that ought to be removed do Bfast/shrimp have the capability to neglect or remove them.

    I am attaching the fastqc plot for base qualities what do you think should be the strategy to do quality control in this case or cases similar to it.

    here is the image link :
    http://s8.postimg.org/gvi9rukqd/per_base_quality.png

    I had posted a question on biostars too about the same : http://www.biostars.org/p/78252/#78264

    1. Hi Saad,

      Yeah looking at your BioStar question and the replies, I think you got some very useful suggestions (much more than I would have been able to suggest!).

      BFAST is quite robust, since it uses multiple indexes, so perhaps it may be able to circumvent the poor sequencing.

      Also you could try an iterative approach of trimming and remapping.

      Cheers,

      Dave

  5. Can you tell me what parameters did you used for your .csfastq files when using Bfast. Also how was your data quality did it look as bad as mine ? I have heard people particularly when analyzing RNA-seq data do not use trimming and filtering but use certain quality criteria in the aligners to neglect low quality bases.

    1. I used the default BFAST paramaters but for reads in nucleotide space. Just try both approaches. Trimming is also used since it is commonly the case that the quality drops off towards the end of the read.

      1. The thing is I want to maintain uniformity. Since I’ll be using Tophat/cufflinks for illumina data it would be better for my analysis that I use similar algorithms/tools to to avoid any bias in my downstream analysis.

        Moreover I need to get results quickly and may not have the time to dabble around with different options.

        Which approach would you think might be better. As suggested in Biostars or to go ahead and use RNA-mate : iterative approach of mapping and trimming.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.