Sequence analysis of SARS-CoV-2 part 3

This post is a continuation of a series of posts on the sequence analysis of SARS-CoV-2; see part 1 and part 2 if you haven’t already. Since my first post, I found out that you can blast sequences against a Betacoronavirus database on NCBI BLAST. The database, as of 2020/03/10, has a total of 7,844 sequences. I point this out because my first post included steps for creating your own “coronavirus” BLAST database and blasting against it. Running a query of “coronavirus” on NCBI’s Nucleotide database now returns 42,102 results (up from 41,874 since 2020/03/01). As I pointed out before, many of these sequences are not related to coronaviruses but regardless there are now 228 additional sequences.

Thu 12 Mar 2020 15:34:41 JST

esearch -db nuccore -query coronavirus

SARS-CoV-2 has now infected 126,258 people and the WHO have just announced yesterday that COVID-19 is now a pandemic. Just today, the NBA season has been suspended due to Rudy Gobert testing positive to SARS-CoV-2.

Anyway enough digression from the main topic of this post, which is the analysis of the raw reads from the Wu et al. study. The raw reads can be downloaded from the Short Read Archive. I had planned on using prefetch from the SRA Toolkit to download the sequences but I kept getting a timeout error; in the end, I just used wget.


I have also uploaded this file to my web server, so you can download it from there if you are having problems downloading from the SRA.

wget -c -N  
wget -c -N 

# verify your download
md5sum -c SRR10971381.md5sum
SRR10971381: OK

Next we will use fastq-dump to create FASTQ files. Note that in the command below I have specified the location of the file “SRR10971381” (which I should have named something else); if you don’t, fastq-dump will try to retrieve the SRA object from the SRA (and at least for me, I kept getting a timeout error) because SRR10971381 is the accession ID.

fastq-dump --split-files ./SRR10971381
Read 28282964 spots for ./SRR10971381
Written 28282964 spots for ./SRR10971381

Now we’ll perform some preprocessing; we will use the all-in-one tool fastp to filter out poor quality reads and trim adapter sequences.

fastp --thread 8 -i SRR10971381_1.fastq -I SRR10971381_2.fastq -o SRR10971381_1_fastp.fastq -O SRR10971381_2_fastp.fastq
Read1 before filtering:
total reads: 28282964
total bases: 4017125680
Q20 bases: 1783384314(44.3945%)
Q30 bases: 1735659038(43.2065%)

Read2 before filtering:
total reads: 28282964
total bases: 4013917534
Q20 bases: 1723725994(42.9437%)
Q30 bases: 1652844944(41.1779%)

Read1 after filtering:
total reads: 13054241
total bases: 1786633510
Q20 bases: 1671652872(93.5644%)
Q30 bases: 1634552420(91.4878%)

Read2 aftering filtering:
total reads: 13054241
total bases: 1782180210
Q20 bases: 1625652911(91.2171%)
Q30 bases: 1568126467(87.9892%)

Filtering result:
reads passed filter: 26108482
reads failed due to low quality: 30441256
reads failed due to too many N: 16190
reads failed due to too short: 0
reads with adapter trimmed: 582728
bases trimmed due to adapters: 30162896

Duplication rate: 5.57505%

Insert size peak (evaluated by paired-end reads): 150 

JSON report: fastp.json
HTML report: fastp.html

fastp --thread 8 -i SRR10971381_1.fastq -I SRR10971381_2.fastq -o SRR10971381_1_fastp.fastq -O SRR10971381_2_fastp.fastq --thread 8
fastp v0.20.0, time used: 544 seconds

You should always inspect the quality of your reads and perform other quality control checks; for this we’ll use FastQC.

mkdir fastqc_out
fastqc -o fastqc_out -f fastq SRR10971381_1_fastp.fastq SRR10971381_2_fastp.fastq

The FastQC HTML reports will be available from my GitHub repo when I push my latest commits. The read qualities were acceptable but there were warnings for GC bias, which is commonly the case because FastQC expects random sequences and this is rarely the case.

The peak at 83% mean GC content was due to the high expression of a human long non-coding RNA (lncRNA) in the BALF, which has a GC content of 74.82%. I blasted this over-represented sequence (TCCCCGCCGGGCCTTCCCAGCCGTCCCGGAGCCGGTCGCGGCGCACCGCC) to find the match to the lncRNA.

Finally, to retrieve the SARS-CoV-2 sequences from the other sequences we will map the FASTQ files back to the assembled genome sequence (accession MN908947).

# create BWA index
bwa index MN908947.fa

# map and sort
bwa mem -t 8 raw/bwa_index/MN908947.fa raw/SRR10971381/SRR10971381_1_fastp.fastq raw/SRR10971381/SRR10971381_2_fastp.fastq | samtools sort - -o result/SRR10971381_MN908947.bam

We’ll use samtools flagstat to obtain some basic statistics on the mapping. We see below that 174,256 out of 26,137,357 reads mapped to MN908947. I will filter out supplementary and unmapped reads and only keep reads that were properly paired.

samtools flagstat -@8 SRR10971381_MN908947.bam
26137357 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
28875 + 0 supplementary
0 + 0 duplicates
174256 + 0 mapped (0.67% : N/A)
26108482 + 0 paired in sequencing
13054241 + 0 read1
13054241 + 0 read2
136034 + 0 properly paired (0.52% : N/A)
136414 + 0 with itself and mate mapped
8967 + 0 singletons (0.03% : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

samtools view -F 0x804 -f 2 -b SRR10971381_MN908947.bam > SRR10971381_MN908947_mapped.bam

We can visualise this BAM file using IGV.

IGV screenshot of SRR10971381_MN908947_mapped.bam reads mapped onto MN908947. Some regions look like there is no coverage when zoomed out but each position is covered by at least 50 reads. The annotation track at the bottom of the screenshot (blue rectangles) is based on the GenBank annotation of MN908947.

If we zoom in a little we can see that the high expression of ORF6 skews the coverage axis.

I was interested in the distribution of mismatches (i.e. edit distances) in the read alignments. I used the Rsamtools package to load the edit distances into R and created a tally.


# load edit distances and mapping qualities from BAM file
my_param <- ScanBamParam(tag = "NM", what = c("mapq"))
bam_list <- scanBam("~/github/sars_cov_2/result/SRR10971381_MN908947_mapped.bam", param = my_param)

# mapping qualities
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.00   60.00   60.00   56.16   60.00   60.00

# number of reads
[1] 136034

# edit distances
options("scipen"= 10)
prop.table(table(bam_list[[1]]$tag$NM)) * 100
         0          1          2          3          4          5          6          7          8          9 
52.1024156 27.1468897 10.1143832  4.1254392  2.2082715  1.3084964  0.8453769  0.6189629  0.4579737  0.3138921 
        10         11         12         13         14         15         16         17         18         19 
 0.2345002  0.1580487  0.1073261  0.0735110  0.0558684  0.0352853  0.0323449  0.0147022  0.0117618  0.0102915 
        20         21         22         24         25         26         29         30         40         43 
 0.0029404  0.0029404  0.0044107  0.0022053  0.0014702  0.0007351  0.0014702  0.0022053  0.0007351  0.0007351 
        49         51         57 
 0.0007351  0.0029404  0.0007351

Over 50% of the reads matched the assembled genome perfectly and over 79% had zero or one mismatches. In my next post, I will examine raw reads from other SARS-CoV-2 studies.

Print Friendly, PDF & Email

Creative Commons License
This work is licensed under a Creative Commons
Attribution 4.0 International License
2 comments Add yours
  1. Hi.
    Thanks for explaining such a nice workflow.
    A quick question.
    /if we need the fasta file of the virus (for example for uploading on GISAID or NCBI), we can convert this final BAM file to Fastq and then convert it to Fasta file using samtools.
    Since we have paired-end seq. The final format looks like this:

    But the typical shape of a FASTA file is something like this:

    Should we merge reads from each pair to prepare the final Fasta file?
    Sorry, my question might look silly, but I really don’t know how to prepare normal Fasta file after discarding human reads.

    1. Hi Zohreh,

      if you are uploading your sequence data to a repository, I would recommend uploading the FASTQ files since they should contain the base calling quality, which is useful for filtering out reads of poor quality.

      If your reads are paired, typically there are two FASTQ files for each pair and each read pair is arranged in the same order in both FASTQ files.

      If your repository insists on FASTA files, I guess you can interleave your reads into one final FASTA file.


Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.