Sequence analysis of SARS-CoV-2

The article "A new coronavirus associated with human respiratory disease in China" released the full genome sequence (29,903 nt) of Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). In the paper, meta-transcriptomic sequencing was performed on bronchoalveolar lavage fluid (BALF) from a 41 year old male suffering from a severe respiratory disease. Contigs were assembled using MEGAHIT and the longest contig that was highly abundance, matched a bat SARS-like coronavirus (CoV) isolate (MG772933). RT-PCR and RACE were used to determine and confirm the genome sequence.

For this post, we will be using Entrez Direct, BLAST, and ClustalW. I created an environment.yml file that will install these tools into a Conda environment called sars_cov_2. This file and all other files used for this post are available on GitHub.

name: sars_cov_2
channels:
  - bioconda
  - anaconda
  - conda-forge
  - defaults
dependencies:
  - entrez-direct
  - blast
  - clustalw

Create the environment and activate it.

conda env create --file environment.yml

conda activate sars_cov_2

We can use efetch to retrieve the sequence deposited by Wu et al.

efetch -db sequences -format fasta -id MN908947 > raw/MN908947.fa

head -3 raw/MN908947.fa 
>MN908947.3 Severe acute respiratory syndrome coronavirus 2 isolate Wuhan-Hu-1, complete genome
ATTAAAGGTTTATACCTTCCCAGGTAACAAACCAACCAACTTTCGATCTCTTGTAGATCTGTTCTCTAAA
CGAACTTTAAAATCTGTGTGGCTGTCACTCGGCTGCATGCTTAGTGCACTCACGCAGTATAATTAATAAC

I'm interested in what other coronaviruses SARS-CoV-2 matches. We can use esearch to perform a query on how many sequences are associated with the term "coronavirus".

esearch -db nuccore -query coronavirus
<ENTREZ_DIRECT>
  <Db>nuccore</Db>
  <WebEnv>NCID_1_121026101_130.14.22.33_9001_1583054896_904858778_0MetA0_S_MegaStore</WebEnv>
  <QueryKey>1</QueryKey>
  <Count>41874</Count>
  <Step>1</Step>
</ENTREZ_DIRECT>

As of 2020/03/01 there were 41874 results. Next we will pipe the esearch with efetch to retrieve these sequences.

esearch -db nuccore -query coronavirus |
efetch -db sequences -format fasta > raw/coronavirus_20200301.fa

I analysed the FASTA sequence definition lines (for a later post) and while most of entries were clearly coronavirus related, some were not. This doesn't matter too much since their sequences will not match SARS-CoV-2.

Next, we will use these FASTA sequences to create a BLAST database.

mkdir db

makeblastdb -dbtype nucl \
            -in raw/coronavirus_20200301.fa \
            -input_type fasta \
            -title coronavirus_20200301 \
            -out db/coronavirus_20200301

Building a new DB, current time: 03/01/2020 19:25:30
New DB name:   /Users/dtang/github/sars_cov_2/db/coronavirus_20200301
New DB title:  coronavirus_20200301
Sequence type: Nucleotide
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 41874 sequences in 9.91486 seconds.

Finally we'll BLAST our SARS-CoV-2 sequence to all the "coronavirus-related" sequences.

blastn -outfmt 7 -query raw/MN908947.fa -db db/coronavirus_20200301 > result/MN908947_blast.txt

I wrote a script to parse the BLAST results. The columns are the accession, alignment length, percentage match, number of mismatches, and the definition in the FASTA header.

script/parse_outfmt7.pl -i result/MN908947_blast.txt -p 80 -l 10000 -f raw/coronavirus_20200301.fa | head
NC_045512.2     29903   100.000 0       Wuhan seafood market pneumonia virus isolate Wuhan-Hu-1, complete genome
MN908947.3      29903   100.000 0       Severe acute respiratory syndrome coronavirus 2 isolate Wuhan-Hu-1, complete genome
MT019531.1      29899   99.997  1       Severe acute respiratory syndrome coronavirus 2 isolate BetaCoV/Wuhan/IPBCAMS-WH-03/2019, complete genome
MN996528.1      29891   100.000 0       Severe acute respiratory syndrome coronavirus 2 isolate WIV04, complete genome
MT019529.1      29899   99.990  3       Severe acute respiratory syndrome coronavirus 2 isolate BetaCoV/Wuhan/IPBCAMS-WH-01/2019, complete genome
MT019532.1      29890   100.000 0       Severe acute respiratory syndrome coronavirus 2 isolate BetaCoV/Wuhan/IPBCAMS-WH-04/2019, complete genome
MT049951.1      29903   99.983  5       Severe acute respiratory syndrome coronavirus 2 isolate SARS-CoV-2/Yunnan-01/human/2020/CHN, complete genome
MN988668.1      29881   100.000 0       Severe acute respiratory syndrome coronavirus 2 isolate 2019-nCoV WHU01, complete genome
MN988669.1      29881   100.000 0       Severe acute respiratory syndrome coronavirus 2 isolate 2019-nCoV WHU02, complete genome
MT019533.1      29883   99.997  1       Severe acute respiratory syndrome coronavirus 2 isolate BetaCoV/Wuhan/IPBCAMS-WH-05/2020, complete genome

The top BLAST results are matches to other isolated sequenced at different laboratories. If we remove results to "Severe acute respiratory syndrome coronavirus 2", we can see the same match to MG772933 (Bat SARS-like coronavirus isolate bat-SL-CoVZC45) with the same nucleotide identity of 89.1% as reported in Wu et al.

script/parse_outfmt7.pl -i result/MN908947_blast.txt -p 80 -l 10000 -f raw/coronavirus_20200301.fa | grep -v "Severe acute respiratory syndrome coronavirus 2" | head
NC_045512.2     29903   100.000 0       Wuhan seafood market pneumonia virus isolate Wuhan-Hu-1, complete genome
LR757996.1      29868   100.000 0       Wuhan seafood market pneumonia virus genome assembly, chromosome: whole_genome
LR757995.1      29871   99.993  2       Wuhan seafood market pneumonia virus genome assembly, chromosome: whole_genome
LR757998.1      29866   99.993  2       Wuhan seafood market pneumonia virus genome assembly, chromosome: whole_genome
MN996532.1      29877   96.117  1136    Bat coronavirus RaTG13, complete genome
MG772933.1      21729   89.116  2278    Bat SARS-like coronavirus isolate bat-SL-CoVZC45, complete genome
MG772934.1      18272   88.655  2036    Bat SARS-like coronavirus isolate bat-SL-CoVZXC21, complete genome
AY394996.1      17716   82.344  2941    SARS coronavirus ZS-B, complete genome
AY394997.1      17716   82.344  2941    SARS coronavirus ZS-A, complete genome
AY395003.1      17716   82.344  2941    SARS coronavirus ZS-C, complete genome

The article "A pneumonia outbreak associated with a new coronavirus of probable bat origin" discusses the close sequence match to Bat coronavirus RaTG13, which may support the hypothesis that SARS-CoV-2 spilled over from bats. Specifically, it is the receptor-binding domain of the spike protein that enables the viruses to enter cells via the ACE2 receptor; this is the second ORF from left to right in the annotation below.

We will use ClustalW to align MN908947 (SARS-CoV-2), MG772933 (bat-SL-CoVZC45), MG772934 (bat-SL-CoVZXC21), and MN996532 (RaTG13) to determine the specific sequence differences between these viruses.

# I manually created wanted.txt by pasting the accessions
script/extract_fasta.pl -i raw/wanted.txt -f raw/coronavirus_20200301.fa > raw/wanted.fa
clustalw -infile=raw/wanted.fa

Below is the multiple sequence alignment of the beginning of the spike protein.

MG772933.1      ATGTT-GTTTTTCTTGTTTCTTCAGTTCGCCTTAGTAAACTCCCAGTGTGTTAACTTGACAGGCAGAACCCCACTCAATCCCAATTATACT
MG772934.1      ATGTT-GTTTTTCTTGTTTCTTCAGTTCGCCTTAGTAAACTCCCAGTGTG---ATTTGACAGGTAGAACTCCACTCAATCCCAATTATACT
MN908947.3      ATGTTTGTTTTTCTTGTTTT----ATTGCCACTAGTCTCTAGTCAGTGTGTTAATCTTACAACCAGAACTCAATTACCCCCTGCATACACT
MN996532.1      ATGTTTGTTTTTCTTGTTTT----ATTGCCACTAGTTTCTAGTCAGTGTGTTAATCTAACAACTAGAACTCAGTTACCTCCTGCATACACC
                ***** *************      **  *  ****       *******   *  * ***   ***** *   *    **    ** **  

MG772933.1      AATTCTTCACAAAGAGGTGTTTATTACCCTGACACAATTTATAGATCAGACACACTTGTG
MG772934.1      AATTCTTCACAAAGAGGTGTTTATTACCCTGACACAATTTATAGATCTGACACACTAGTG
MN908947.3      AATTCTTTCACACGTGGTGTTTATTACCCTGACAAAGTTTTCAGATCCT-CAGTTTTACA
MN996532.1      AACTCATCCACCCGTGGTGTCTATTACCCTGACAAAGTTTTCAGATCTT-CAGTTTTACA
                ** ** *      * ***** ************* * ***  *****   **   *

From the multiple sequence alignment, we can see that MN908947 and MN996532 are more closely related than MG772933 and MG772934 (same results as the BLAST report).

It would be interesting to see what differences there are between the spike proteins of MN996532 and MN908947. To this end, I manually extracted the nucleotide sequences from the alignment of the spike protein for MN996532 and MN908947. Next, I converted the nucleotide sequences into amino acids and aligned them with ClustalW. Below is the amino acid alignment.

CLUSTAL 2.1 multiple sequence alignment


MN996532.1      MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSSTRGVYYPDKVFRSSVLHLTQDLFLPFFS
MN908947.3      MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFRSSVLHSTQDLFLPFFS
                ******************************* ***************** **********

MN996532.1      NVTWFHAIHVSGTNGIKRFDNPVLPFNDGVYFASTEKSNIIRGWIFGTTLDSKTQSLLIV
MN908947.3      NVTWFHAIHVSGTNGTKRFDNPVLPFNDGVYFASTEKSNIIRGWIFGTTLDSKTQSLLIV
                *************** ********************************************

MN996532.1      NNATNVVIKVCEFQFCNDPFLGVYYHKNNKSWMESEFRVYSSANNCTFEYVSQPFLMDLE
MN908947.3      NNATNVVIKVCEFQFCNDPFLGVYYHKNNKSWMESEFRVYSSANNCTFEYVSQPFLMDLE
                ************************************************************

MN996532.1      GKQGNFKNLREFVFKNIDGYFKIYSKHTPINLVRDLPPGFSALEPLVDLPIGINITRFQT
MN908947.3      GKQGNFKNLREFVFKNIDGYFKIYSKHTPINLVRDLPQGFSALEPLVDLPIGINITRFQT
                ************************************* **********************

MN996532.1      LLALHRSYLTPGDSSSGWTAGAAAYYVGYLQPRTFLLKYNENGTITDAVDCALDPLSETK
MN908947.3      LLALHRSYLTPGDSSSGWTAGAAAYYVGYLQPRTFLLKYNENGTITDAVDCALDPLSETK
                ************************************************************

MN996532.1      CTLKSFTVEKGIYQTSNFRVQPTDSIVRFPNITNLCPFGEVFNATTFASVYAWNRKRISN
MN908947.3      CTLKSFTVEKGIYQTSNFRVQPTESIVRFPNITNLCPFGEVFNATRFASVYAWNRKRISN
                ***********************:********************* **************

MN996532.1      CVADYSVLYNSTSFSTFKCYGVSPTKLNDLCFTNVYADSFVITGDEVRQIAPGQTGKIAD
MN908947.3      CVADYSVLYNSASFSTFKCYGVSPTKLNDLCFTNVYADSFVIRGDEVRQIAPGQTGKIAD
                ***********:****************************** *****************

MN996532.1      YNYKLPDDFTGCVIAWNSKHIDAKEGGNFNYLYRLFRKANLKPFERDISTEIYQAGSKPC
MN908947.3      YNYKLPDDFTGCVIAWNSNNLDSKVGGNYNYLYRLFRKSNLKPFERDISTEIYQAGSTPC
                ******************:::*:* ***:*********:******************.**

MN996532.1      NGQTGLNCYYPLYRYGFYPTDGVGHQPYRVVVLSFELLNAPATVCGPKKSTNLVKNKCVN
MN908947.3      NGVEGFNCYFPLQSYGFQPTNGVGYQPYRVVVLSFELLHAPATVCGPKKSTNLVKNKCVN
                **  *:***:**  *** **:***:*************:*********************

MN996532.1      FNFNGLTGTGVLTESNKKFLPFQQFGRDIADTTDAVRDPQTLEILDITPCSFGGVSVITP
MN908947.3      FNFNGLTGTGVLTESNKKFLPFQQFGRDIADTTDAVRDPQTLEILDITPCSFGGVSVITP
                ************************************************************

MN996532.1      GTNASNQVAVLYQDVNCTEVPVAIHADQLTPTWRVYSTGSNVFQTRAGCLIGAEHVNNSY
MN908947.3      GTNTSNQVAVLYQDVNCTEVPVAIHADQLTPTWRVYSTGSNVFQTRAGCLIGAEHVNNSY
                ***:********************************************************

MN996532.1      ECDIPIGAGICASYQTQTNS----RSVASQSIIAYTMSLGAENSVAYSNNSIAIPTNFTI
MN908947.3      ECDIPIGAGICASYQTQTNSPRRARSVASQSIIAYTMSLGAENSVAYSNNSIAIPTNFTI
                ********************    ************************************

MN996532.1      SVTTEILPVSMTKTSVDCTMYICGDSTECSNLLLQYGSFCTQLNRALTGIAVEQDKNTQE
MN908947.3      SVTTEILPVSMTKTSVDCTMYICGDSTECSNLLLQYGSFCTQLNRALTGIAVEQDKNTQE
                ************************************************************

MN996532.1      VFAQVKQIYKTPPIKDFGGFNFSQILPDPSKPSKRSFIEDLLFNKVTLADAGFIKQYGDC
MN908947.3      VFAQVKQIYKTPPIKDFGGFNFSQILPDPSKPSKRSFIEDLLFNKVTLADAGFIKQYGDC
                ************************************************************

MN996532.1      LGDIAARDLICAQKFNGLTVLPPLLTDEMIAQYTSALLAGTITSGWTFGAGAALQIPFAM
MN908947.3      LGDIAARDLICAQKFNGLTVLPPLLTDEMIAQYTSALLAGTITSGWTFGAGAALQIPFAM
                ************************************************************

MN996532.1      QMAYRFNGIGVTQNVLYENQKLIANQFNSAIGKIQDSLSSTASALGKLQDVVNQNAQALN
MN908947.3      QMAYRFNGIGVTQNVLYENQKLIANQFNSAIGKIQDSLSSTASALGKLQDVVNQNAQALN
                ************************************************************

MN996532.1      TLVKQLSSNFGAISSVLNDILSRLDKVEAEVQIDRLITGRLQSLQTYVTQQLIRAAEIRA
MN908947.3      TLVKQLSSNFGAISSVLNDILSRLDKVEAEVQIDRLITGRLQSLQTYVTQQLIRAAEIRA
                ************************************************************

MN996532.1      SANLAATKMSECVLGQSKRVDFCGKGYHLMSFPQSAPHGVVFLHVTYVPAQEKNFTTAPA
MN908947.3      SANLAATKMSECVLGQSKRVDFCGKGYHLMSFPQSAPHGVVFLHVTYVPAQEKNFTTAPA
                ************************************************************

MN996532.1      ICHDGKAHFPREGVFVSNGTHWFVTQRNFYEPQIITTDNTFVSGSCDVVIGIVNNTVYDP
MN908947.3      ICHDGKAHFPREGVFVSNGTHWFVTQRNFYEPQIITTDNTFVSGNCDVVIGIVNNTVYDP
                ********************************************.***************

MN996532.1      LQPELDSFKEELDKYFKNHTSPDVDLGDISGINASVVNIQKEIDRLNEVAKNLNESLIDL
MN908947.3      LQPELDSFKEELDKYFKNHTSPDVDLGDISGINASVVNIQKEIDRLNEVAKNLNESLIDL
                ************************************************************

MN996532.1      QELGKYEQYIKWPWYIWLGFIAGLIAIIMVTIMLCCMTSCCSCLKGCCSCGSCCKFDEDD
MN908947.3      QELGKYEQYIKWPWYIWLGFIAGLIAIVMVTIMLCCMTSCCSCLKGCCSCGSCCKFDEDD
                ***************************:********************************

MN996532.1      SEPVLKGVKLHYT
MN908947.3      SEPVLKGVKLHYT
                *************

More to come after I finish reading "Cryo-EM structure of the 2019-nCoV spike in the prefusion conformation". For all code used in this post, please refer to my GitHub.

This work is licensed under a Creative Commons
Attribution 4.0 International License.

5 comments Add yours

John P says:

April 18, 2020 at 00:08

Hi Dave,

Can you give any assistance on a question I have. I’m just trying to figure out how, based on RNA sequencing of an entire bronchoavleolar lavage fluid sample, a virus with a partly novel genome is able to be, with a sufficient degree of certainty, ascertained. How is it known that the novel parts do not belong in fact to something else? Is there not a great deal of uncertainty here?

And wouldn’t it a basic part of the error-checking process, if a new coronavirus is suspected, to purify the viral particles of that size and density using a robust density centrifugation protocol, and then sequence the purified virus particles?

1. Davo says:
  
  April 18, 2020 at 00:33
  
  Hi John,
  
  thanks for the question. I’m definitely not an expert on metagenome (or in this case metatranscriptome) assembly but here’s my basic understanding. Most assembly programs use a graph-based approach and in the case of the paper described in this post, MEGAHIT (https://www.ncbi.nlm.nih.gov/pubmed/25609793) was used. Each read is represented as a node (or vertex) and edges connect reads that have overlapping sequences. A read that has all but 1 nucleotide overlapping with another read most likely came from the same source and will be connected; an assembly is built as all reads are connected. Figure 3 of this paper (https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1000667) has a nice figure of this process and provides a better description of the assembly process.
  
  Since there are some differences amongst all the viruses, different contigs will be assembled. The authors of the paper used RACE (https://en.wikipedia.org/wiki/Rapid_amplification_of_cDNA_ends) to verify the start and end of the SARS-CoV-2 genome.
  
  Best,
  Dave
  
  1. John Pospichal says:
    
    April 18, 2020 at 05:02
    
    Hi Dave,
    
    Thanks for the info!
    
Asuman I says:

February 15, 2022 at 18:06

Hi Dave,

You are a legend! Thanks a million for the effort!

When I run `conda env create –file environment.yml` it gets stuck at “Solving Environment”. But if I use `mamba` it install everything and even instantaneously. Why do you think so?

1. Davo says:
  
  February 15, 2022 at 23:37
  
  Hi Asuman,
  
  Mamba uses libsolv for solving dependencies, which is much faster than Conda’s implementation. So please use Mamba.
  
  Cheers,
  Dave

Leave a Reply Cancel reply