The Sequence Read Archive (SRA) is the largest publicly available repository of high throughput sequencing data. (Fun fact: it used to be called the Short Read Archive since most of the data was from short read sequencers.) The tool fastq-dump
from the SRA Toolkit can be used to download SRA data. A while ago I learned of fasterq-dump
, which as the name suggests is a faster version of fastq-dump
. Below are the elapsed times for downloading SRR390728 using the two tools.
time fastq-dump SRR390728
# Read 7178576 spots for SRR390728
# Written 7178576 spots for SRR390728
#
# real 15m22.349s
# user 3m16.858s
# sys 0m22.203s
time fasterq-dump --split-files SRR390728
# spots read : 7,178,576
# reads read : 14,357,152
# reads written : 14,357,152
#
# real 7m3.119s
# user 2m12.225s
# sys 0m21.876s
There is a tool called prefetch
, but fasterq-dump
performs the prefetch
step and FASTQ conversion in a single step meaning that you do not need to use prefetch
with fasterq-dump
.
I have been using fasterq-dump
and it does the job but it's very slow and sometimes (more often than I would like it to) it crashes in the middle of a download.
I looked around for a faster solution and came across parallel-fastq-dump, which cleverly splits a download up into independent blocks and downloads each block in parallel. However, the download simply hung when I tried to use it.
Finally I found out that AWS is hosting all the SRA data and has made it freely accessible from Amazon S3!
Downloading the same data set from AWS took only 30 seconds compared to over 7 minutes using fasterq-dump
.
time aws s3 sync s3://sra-pub-run-odp/sra/SRR390728 SRR390728 --no-sign-request
# download: s3://sra-pub-run-odp/sra/SRR390728/SRR390728 to SRR390728/SRR390728
#
# real 0m29.429s
# user 0m2.701s
# sys 0m1.640s
The S3 bucket is in the us-east-1
region, so if you are in the US and on the East Coast, you should have much faster download speeds than me (since I'm downloading from Japan).
I have a longer and more explanatory write up in my GitHub repo.
This work is licensed under a Creative Commons
Attribution 4.0 International License.
Hi Dave, thank you for your post. I have been using asap to download SRA data up till now. However it usually encounters problems! I will try this method in Japan to see whether it is also fast here.
Hi Hirakawa san,
thanks for the comment. `ascp` worked quite well for me (in Japan) and is definitely the fastest method; I could use it to download several terabytes of data without any problems. I have additional notes in my GitHub repository https://github.com/davetang/research_parasite.
I also just checked out your blog. Good to see another bioinformatics blog, especially one from Japan!
Cheers,
Dave
Hi Dave, thank you for your notes.
I’ve tried aws downloading (ERR4183340, 10GB) but unfortunately it is not fast (<10MB/s), even slower than wget from FTP (~15MB/s).
And it is also sad that currently Aspera always runs into session error when downloading from era-fasp@fasp.sra.ebi.ac.uk .
Perhaps try to wait for some days until fasp.sra.ebi.ac.uk recovers.
Hi Hirakawa san,
I could download ERR4183340 using Aspera.
I’ve never run into any problems so far when using Aspera.
Cheers,
Dave
Ah. I see. It seems there are some problems with the network in my server.