TIL that you can download SRA data from AWS

The Sequence Read Archive (SRA) is the largest publicly available repository of high throughput sequencing data. (Fun fact: it used to be called the Short Read Archive since most of the data was from short read sequencers.) The tool fastq-dump from the SRA Toolkit can be used to download SRA data. A while ago I learned of fasterq-dump, which as the name suggests is a faster version of fastq-dump. Below are the elapsed times for downloading SRR390728 using the two tools.

time fastq-dump SRR390728
# Read 7178576 spots for SRR390728
# Written 7178576 spots for SRR390728
# 
# real    15m22.349s
# user    3m16.858s
# sys     0m22.203s

time fasterq-dump --split-files SRR390728
# spots read      : 7,178,576
# reads read      : 14,357,152
# reads written   : 14,357,152
# 
# real    7m3.119s
# user    2m12.225s
# sys     0m21.876s

There is a tool called prefetch, but fasterq-dump performs the prefetch step and FASTQ conversion in a single step meaning that you do not need to use prefetch with fasterq-dump.

I have been using fasterq-dump and it does the job but it's very slow and sometimes (more often than I would like it to) it crashes in the middle of a download.

I looked around for a faster solution and came across parallel-fastq-dump, which cleverly splits a download up into independent blocks and downloads each block in parallel. However, the download simply hung when I tried to use it.

Finally I found out that AWS is hosting all the SRA data and has made it freely accessible from Amazon S3!

Downloading the same data set from AWS took only 30 seconds compared to over 7 minutes using fasterq-dump.

time aws s3 sync s3://sra-pub-run-odp/sra/SRR390728 SRR390728 --no-sign-request
# download: s3://sra-pub-run-odp/sra/SRR390728/SRR390728 to SRR390728/SRR390728
# 
# real    0m29.429s
# user    0m2.701s
# sys     0m1.640s

The S3 bucket is in the us-east-1 region, so if you are in the US and on the East Coast, you should have much faster download speeds than me (since I'm downloading from Japan).

I have a longer and more explanatory write up in my GitHub repo.

Print Friendly, PDF & Email



Creative Commons License
This work is licensed under a Creative Commons
Attribution 4.0 International License
.
5 comments Add yours
  1. Hi Dave, thank you for your post. I have been using asap to download SRA data up till now. However it usually encounters problems! I will try this method in Japan to see whether it is also fast here.

    1. Hi Hirakawa san,

      thanks for the comment. `ascp` worked quite well for me (in Japan) and is definitely the fastest method; I could use it to download several terabytes of data without any problems. I have additional notes in my GitHub repository https://github.com/davetang/research_parasite.

      I also just checked out your blog. Good to see another bioinformatics blog, especially one from Japan!

      Cheers,
      Dave

      1. Hi Dave, thank you for your notes.

        I’ve tried aws downloading (ERR4183340, 10GB) but unfortunately it is not fast (<10MB/s), even slower than wget from FTP (~15MB/s).

        And it is also sad that currently Aspera always runs into session error when downloading from era-fasp@fasp.sra.ebi.ac.uk .

        Perhaps try to wait for some days until fasp.sra.ebi.ac.uk recovers.

        1. Hi Hirakawa san,

          I could download ERR4183340 using Aspera.

          docker run --rm -it -u parasite davetang/aspera_connect:4.2.5.306 /bin/bash
          cd
          
          time ascp -QT -l 300m -P33001 -i $HOME/asperaweb_id_dsa.openssh era-fasp@fasp.sra.ebi.ac.uk:vol1/fastq/ERR418/000/ERR4183340/ERR4183340_1.fastq.gz .
          ERR4183340_1.fastq.gz                                                                                                                                           100% 6092MB  125Mb/s    03:13    
          Completed: 6238875K bytes transferred in 195 seconds
           (261238K bits/sec), in 1 file.
          
          real    3m22.633s
          user    0m21.617s
          sys     1m1.282s
          
          time ascp -QT -l 300m -P33001 -i $HOME/asperaweb_id_dsa.openssh era-fasp@fasp.sra.ebi.ac.uk:vol1/fastq/ERR418/000/ERR4183340/ERR4183340_2.fastq.gz .
          ERR4183340_2.fastq.gz                                                                                                                                           100% 6879MB  290Mb/s    04:58    
          Completed: 7045040K bytes transferred in 299 seconds
           (192999K bits/sec), in 1 file.
          
          real    5m6.367s
          user    0m25.166s
          sys     1m8.711s
          

          I’ve never run into any problems so far when using Aspera.

          Cheers,
          Dave

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.