This post is a continuation of a series of posts on the sequence analysis of SARS-CoV-2; see part 1 and part 2 if you haven't already. Since my first post, I found out that you can blast sequences against a Betacoronavirus database on NCBI BLAST. The database, as of 2020/03/10, has a total of 7,844 sequences. I point this out because my first post included steps for creating your own "coronavirus" BLAST database and blasting against it. Running a query of "coronavirus" on NCBI's Nucleotide database now returns 42,102 results (up from 41,874 since 2020/03/01). As I pointed out before, many of these sequences are not related to coronaviruses but regardless there are now 228 additional sequences.
date Thu 12 Mar 2020 15:34:41 JST esearch -db nuccore -query coronavirus <ENTREZ_DIRECT> <Db>nuccore</Db> <WebEnv>NCID_1_153384575_130.14.18.97_9001_1583994493_2074949830_0MetA0_S_MegaStore</WebEnv> <QueryKey>1</QueryKey> <Count>42102</Count> <Step>1</Step> </ENTREZ_DIRECT>
SARS-CoV-2 has now infected 126,258 people and the WHO have just announced yesterday that COVID-19 is now a pandemic. Just today, the NBA season has been suspended due to Rudy Gobert testing positive to SARS-CoV-2.
NBA To Suspend Season Following Tonight's Games pic.twitter.com/2PTx2fkLlW
— NBA (@NBA) March 12, 2020
Anyway enough digression from the main topic of this post, which is the analysis of the raw reads from the Wu et al. study. The raw reads can be downloaded from the Short Read Archive. I had planned on using prefetch from the SRA Toolkit to download the sequences but I kept getting a timeout error; in the end, I just used wget.
wget https://sra-download.ncbi.nlm.nih.gov/traces/sra46/SRR/010714/SRR10971381
I have also uploaded this file to my web server, so you can download it from there if you are having problems downloading from the SRA.
wget -c -N https://davetang.org/file/SRR10971381 wget -c -N https://davetang.org/file/SRR10971381.md5sum # verify your download md5sum -c SRR10971381.md5sum SRR10971381: OK
Next we will use fastq-dump to create FASTQ files. Note that in the command below I have specified the location of the file "SRR10971381" (which I should have named something else); if you don't, fastq-dump will try to retrieve the SRA object from the SRA (and at least for me, I kept getting a timeout error) because SRR10971381 is the accession ID.
fastq-dump --split-files ./SRR10971381 Read 28282964 spots for ./SRR10971381 Written 28282964 spots for ./SRR10971381
Now we'll perform some preprocessing; we will use the all-in-one tool fastp to filter out poor quality reads and trim adapter sequences.
fastp --thread 8 -i SRR10971381_1.fastq -I SRR10971381_2.fastq -o SRR10971381_1_fastp.fastq -O SRR10971381_2_fastp.fastq Read1 before filtering: total reads: 28282964 total bases: 4017125680 Q20 bases: 1783384314(44.3945%) Q30 bases: 1735659038(43.2065%) Read2 before filtering: total reads: 28282964 total bases: 4013917534 Q20 bases: 1723725994(42.9437%) Q30 bases: 1652844944(41.1779%) Read1 after filtering: total reads: 13054241 total bases: 1786633510 Q20 bases: 1671652872(93.5644%) Q30 bases: 1634552420(91.4878%) Read2 aftering filtering: total reads: 13054241 total bases: 1782180210 Q20 bases: 1625652911(91.2171%) Q30 bases: 1568126467(87.9892%) Filtering result: reads passed filter: 26108482 reads failed due to low quality: 30441256 reads failed due to too many N: 16190 reads failed due to too short: 0 reads with adapter trimmed: 582728 bases trimmed due to adapters: 30162896 Duplication rate: 5.57505% Insert size peak (evaluated by paired-end reads): 150 JSON report: fastp.json HTML report: fastp.html fastp --thread 8 -i SRR10971381_1.fastq -I SRR10971381_2.fastq -o SRR10971381_1_fastp.fastq -O SRR10971381_2_fastp.fastq --thread 8 fastp v0.20.0, time used: 544 seconds
You should always inspect the quality of your reads and perform other quality control checks; for this we'll use FastQC.
mkdir fastqc_out fastqc -o fastqc_out -f fastq SRR10971381_1_fastp.fastq SRR10971381_2_fastp.fastq
The FastQC HTML reports will be available from my GitHub repo when I push my latest commits. The read qualities were acceptable but there were warnings for GC bias, which is commonly the case because FastQC expects random sequences and this is rarely the case.
The peak at 83% mean GC content was due to the high expression of a human long non-coding RNA (lncRNA) in the BALF, which has a GC content of 74.82%. I blasted this over-represented sequence (TCCCCGCCGGGCCTTCCCAGCCGTCCCGGAGCCGGTCGCGGCGCACCGCC) to find the match to the lncRNA.
Finally, to retrieve the SARS-CoV-2 sequences from the other sequences we will map the FASTQ files back to the assembled genome sequence (accession MN908947).
# create BWA index bwa index MN908947.fa # map and sort bwa mem -t 8 raw/bwa_index/MN908947.fa raw/SRR10971381/SRR10971381_1_fastp.fastq raw/SRR10971381/SRR10971381_2_fastp.fastq | samtools sort - -o result/SRR10971381_MN908947.bam
We'll use samtools flagstat to obtain some basic statistics on the mapping. We see below that 174,256 out of 26,137,357 reads mapped to MN908947. I will filter out supplementary and unmapped reads and only keep reads that were properly paired.
samtools flagstat -@8 SRR10971381_MN908947.bam 26137357 + 0 in total (QC-passed reads + QC-failed reads) 0 + 0 secondary 28875 + 0 supplementary 0 + 0 duplicates 174256 + 0 mapped (0.67% : N/A) 26108482 + 0 paired in sequencing 13054241 + 0 read1 13054241 + 0 read2 136034 + 0 properly paired (0.52% : N/A) 136414 + 0 with itself and mate mapped 8967 + 0 singletons (0.03% : N/A) 0 + 0 with mate mapped to a different chr 0 + 0 with mate mapped to a different chr (mapQ>=5) samtools view -F 0x804 -f 2 -b SRR10971381_MN908947.bam > SRR10971381_MN908947_mapped.bam
We can visualise this BAM file using IGV.
IGV screenshot of SRR10971381_MN908947_mapped.bam reads mapped onto MN908947. Some regions look like there is no coverage when zoomed out but each position is covered by at least 50 reads. The annotation track at the bottom of the screenshot (blue rectangles) is based on the GenBank annotation of MN908947.
If we zoom in a little we can see that the high expression of ORF6 skews the coverage axis.
I was interested in the distribution of mismatches (i.e. edit distances) in the read alignments. I used the Rsamtools package to load the edit distances into R and created a tally.
library(Rsamtools) # load edit distances and mapping qualities from BAM file my_param <- ScanBamParam(tag = "NM", what = c("mapq")) bam_list <- scanBam("~/github/sars_cov_2/result/SRR10971381_MN908947_mapped.bam", param = my_param) # mapping qualities summary(bam_list[[1]]$mapq) Min. 1st Qu. Median Mean 3rd Qu. Max. 0.00 60.00 60.00 56.16 60.00 60.00 # number of reads length(bam_list[[1]]$tag$NM) [1] 136034 # edit distances options("scipen"= 10) prop.table(table(bam_list[[1]]$tag$NM)) * 100 0 1 2 3 4 5 6 7 8 9 52.1024156 27.1468897 10.1143832 4.1254392 2.2082715 1.3084964 0.8453769 0.6189629 0.4579737 0.3138921 10 11 12 13 14 15 16 17 18 19 0.2345002 0.1580487 0.1073261 0.0735110 0.0558684 0.0352853 0.0323449 0.0147022 0.0117618 0.0102915 20 21 22 24 25 26 29 30 40 43 0.0029404 0.0029404 0.0044107 0.0022053 0.0014702 0.0007351 0.0014702 0.0022053 0.0007351 0.0007351 49 51 57 0.0007351 0.0029404 0.0007351
Over 50% of the reads matched the assembled genome perfectly and over 79% had zero or one mismatches. In my next post, I will examine raw reads from other SARS-CoV-2 studies.
This work is licensed under a Creative Commons
Attribution 4.0 International License.
Hi.
Thanks for explaining such a nice workflow.
A quick question.
/if we need the fasta file of the virus (for example for uploading on GISAID or NCBI), we can convert this final BAM file to Fastq and then convert it to Fasta file using samtools.
Since we have paired-end seq. The final format looks like this:
>SRR10971381.10000000/1
TTTCCGGCACCGGGCAGGTGTCAGGCTGTATACATCATCTTTCGAGTTAGCACAGCCCTGTGTTTTTGTTAAACAGTTGCCTGGNCCTATTCTCTGCGCCTCATATTGCTATGCCG
>SRR10971381.10000000/2
TTTCCGGCACCGGGCAGGTGTCAGGCNGTATACATCATNTTTCGAGTTAGCACAGCCCTGTGTTTTTGTTAAACAGTTGCCTGGACCTATTCTCTGCGCCTCATATTGCTATGAAT
>SRR10971381.10000005/1
GCCTGGACCTTCCGTATTACCGCGGCTGCTGGCACGGAATTAGCCGGTCCTTATTCGTATGGTACCTGCAAACAATCACACGTGANTGACTTTATCCCCATACAAANGCAGTTTACAACCCCGG
>SRR10971381.10000005/2
GCCTGGACCTTCCGTATTACCGCGGCTGCTGGCACGGAATTAGCCGGTCCTTATTCGTATGGTACCTGCAAACAATCACACGTGATTGACTTTATCCCCATACAAAAGCAGTTTACAACCC
…
But the typical shape of a FASTA file is something like this:
>hCoV-19/Yichun/JX2325/2020|EPI_ISL_455460|2020-02-09
TAACAAACCAACCAACTTTCGATCTCTTGTAGATCTGTTCTCTAAACGAACTTTAAAATCTGTGTGGCTGTCACTCGGCT
GCATGCTTAGTGCACTCACGCAGTATAATTAATAACTAATTACTGTCGTTGACAGGACACGAGTAACTCGTCTATCTTCT
….
Should we merge reads from each pair to prepare the final Fasta file?
Sorry, my question might look silly, but I really don’t know how to prepare normal Fasta file after discarding human reads.
Regards,
Zohreh
Hi Zohreh,
if you are uploading your sequence data to a repository, I would recommend uploading the FASTQ files since they should contain the base calling quality, which is useful for filtering out reads of poor quality.
If your reads are paired, typically there are two FASTQ files for each pair and each read pair is arranged in the same order in both FASTQ files.
If your repository insists on FASTA files, I guess you can interleave your reads into one final FASTA file.
Cheers,
Dave
Hello,
do you know how I could assemble the raw datasets without 100GB of ram?
I’d like to obtain a set of assembled contigs to try to perform different alignments.
I’ve tried with SPAdes but it requires too much resources.
Best regards and thank you for this informative blog.