The article "A new coronavirus associated with human respiratory disease in China" released the full genome sequence (29,903 nt) of Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). In the paper, meta-transcriptomic sequencing was performed on bronchoalveolar lavage fluid (BALF) from a 41 year old male suffering from a severe respiratory disease. Contigs were assembled using MEGAHIT and the longest contig that was highly abundance, matched a bat SARS-like coronavirus (CoV) isolate (MG772933). RT-PCR and RACE were used to determine and confirm the genome sequence.
For this post, we will be using Entrez Direct, BLAST, and ClustalW. I created an environment.yml file that will install these tools into a Conda environment called sars_cov_2. This file and all other files used for this post are available on GitHub.
name: sars_cov_2 channels: - bioconda - anaconda - conda-forge - defaults dependencies: - entrez-direct - blast - clustalw
Create the environment and activate it.
conda env create --file environment.yml conda activate sars_cov_2
We can use efetch to retrieve the sequence deposited by Wu et al.
efetch -db sequences -format fasta -id MN908947 > raw/MN908947.fa head -3 raw/MN908947.fa >MN908947.3 Severe acute respiratory syndrome coronavirus 2 isolate Wuhan-Hu-1, complete genome ATTAAAGGTTTATACCTTCCCAGGTAACAAACCAACCAACTTTCGATCTCTTGTAGATCTGTTCTCTAAA CGAACTTTAAAATCTGTGTGGCTGTCACTCGGCTGCATGCTTAGTGCACTCACGCAGTATAATTAATAAC
I'm interested in what other coronaviruses SARS-CoV-2 matches. We can use esearch to perform a query on how many sequences are associated with the term "coronavirus".
esearch -db nuccore -query coronavirus <ENTREZ_DIRECT> <Db>nuccore</Db> <WebEnv>NCID_1_121026101_130.14.22.33_9001_1583054896_904858778_0MetA0_S_MegaStore</WebEnv> <QueryKey>1</QueryKey> <Count>41874</Count> <Step>1</Step> </ENTREZ_DIRECT>
As of 2020/03/01 there were 41874 results. Next we will pipe the esearch with efetch to retrieve these sequences.
esearch -db nuccore -query coronavirus | efetch -db sequences -format fasta > raw/coronavirus_20200301.fa
I analysed the FASTA sequence definition lines (for a later post) and while most of entries were clearly coronavirus related, some were not. This doesn't matter too much since their sequences will not match SARS-CoV-2.
Next, we will use these FASTA sequences to create a BLAST database.
mkdir db
makeblastdb -dbtype nucl \
-in raw/coronavirus_20200301.fa \
-input_type fasta \
-title coronavirus_20200301 \
-out db/coronavirus_20200301
Building a new DB, current time: 03/01/2020 19:25:30
New DB name: /Users/dtang/github/sars_cov_2/db/coronavirus_20200301
New DB title: coronavirus_20200301
Sequence type: Nucleotide
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 41874 sequences in 9.91486 seconds.
Finally we'll BLAST our SARS-CoV-2 sequence to all the "coronavirus-related" sequences.
blastn -outfmt 7 -query raw/MN908947.fa -db db/coronavirus_20200301 > result/MN908947_blast.txt
I wrote a script to parse the BLAST results. The columns are the accession, alignment length, percentage match, number of mismatches, and the definition in the FASTA header.
script/parse_outfmt7.pl -i result/MN908947_blast.txt -p 80 -l 10000 -f raw/coronavirus_20200301.fa | head NC_045512.2 29903 100.000 0 Wuhan seafood market pneumonia virus isolate Wuhan-Hu-1, complete genome MN908947.3 29903 100.000 0 Severe acute respiratory syndrome coronavirus 2 isolate Wuhan-Hu-1, complete genome MT019531.1 29899 99.997 1 Severe acute respiratory syndrome coronavirus 2 isolate BetaCoV/Wuhan/IPBCAMS-WH-03/2019, complete genome MN996528.1 29891 100.000 0 Severe acute respiratory syndrome coronavirus 2 isolate WIV04, complete genome MT019529.1 29899 99.990 3 Severe acute respiratory syndrome coronavirus 2 isolate BetaCoV/Wuhan/IPBCAMS-WH-01/2019, complete genome MT019532.1 29890 100.000 0 Severe acute respiratory syndrome coronavirus 2 isolate BetaCoV/Wuhan/IPBCAMS-WH-04/2019, complete genome MT049951.1 29903 99.983 5 Severe acute respiratory syndrome coronavirus 2 isolate SARS-CoV-2/Yunnan-01/human/2020/CHN, complete genome MN988668.1 29881 100.000 0 Severe acute respiratory syndrome coronavirus 2 isolate 2019-nCoV WHU01, complete genome MN988669.1 29881 100.000 0 Severe acute respiratory syndrome coronavirus 2 isolate 2019-nCoV WHU02, complete genome MT019533.1 29883 99.997 1 Severe acute respiratory syndrome coronavirus 2 isolate BetaCoV/Wuhan/IPBCAMS-WH-05/2020, complete genome
The top BLAST results are matches to other isolated sequenced at different laboratories. If we remove results to "Severe acute respiratory syndrome coronavirus 2", we can see the same match to MG772933 (Bat SARS-like coronavirus isolate bat-SL-CoVZC45) with the same nucleotide identity of 89.1% as reported in Wu et al.
script/parse_outfmt7.pl -i result/MN908947_blast.txt -p 80 -l 10000 -f raw/coronavirus_20200301.fa | grep -v "Severe acute respiratory syndrome coronavirus 2" | head NC_045512.2 29903 100.000 0 Wuhan seafood market pneumonia virus isolate Wuhan-Hu-1, complete genome LR757996.1 29868 100.000 0 Wuhan seafood market pneumonia virus genome assembly, chromosome: whole_genome LR757995.1 29871 99.993 2 Wuhan seafood market pneumonia virus genome assembly, chromosome: whole_genome LR757998.1 29866 99.993 2 Wuhan seafood market pneumonia virus genome assembly, chromosome: whole_genome MN996532.1 29877 96.117 1136 Bat coronavirus RaTG13, complete genome MG772933.1 21729 89.116 2278 Bat SARS-like coronavirus isolate bat-SL-CoVZC45, complete genome MG772934.1 18272 88.655 2036 Bat SARS-like coronavirus isolate bat-SL-CoVZXC21, complete genome AY394996.1 17716 82.344 2941 SARS coronavirus ZS-B, complete genome AY394997.1 17716 82.344 2941 SARS coronavirus ZS-A, complete genome AY395003.1 17716 82.344 2941 SARS coronavirus ZS-C, complete genome
The article "A pneumonia outbreak associated with a new coronavirus of probable bat origin" discusses the close sequence match to Bat coronavirus RaTG13, which may support the hypothesis that SARS-CoV-2 spilled over from bats. Specifically, it is the receptor-binding domain of the spike protein that enables the viruses to enter cells via the ACE2 receptor; this is the second ORF from left to right in the annotation below.
We will use ClustalW to align MN908947 (SARS-CoV-2), MG772933 (bat-SL-CoVZC45), MG772934 (bat-SL-CoVZXC21), and MN996532 (RaTG13) to determine the specific sequence differences between these viruses.
# I manually created wanted.txt by pasting the accessions script/extract_fasta.pl -i raw/wanted.txt -f raw/coronavirus_20200301.fa > raw/wanted.fa clustalw -infile=raw/wanted.fa
Below is the multiple sequence alignment of the beginning of the spike protein.
MG772933.1 ATGTT-GTTTTTCTTGTTTCTTCAGTTCGCCTTAGTAAACTCCCAGTGTGTTAACTTGACAGGCAGAACCCCACTCAATCCCAATTATACT
MG772934.1 ATGTT-GTTTTTCTTGTTTCTTCAGTTCGCCTTAGTAAACTCCCAGTGTG---ATTTGACAGGTAGAACTCCACTCAATCCCAATTATACT
MN908947.3 ATGTTTGTTTTTCTTGTTTT----ATTGCCACTAGTCTCTAGTCAGTGTGTTAATCTTACAACCAGAACTCAATTACCCCCTGCATACACT
MN996532.1 ATGTTTGTTTTTCTTGTTTT----ATTGCCACTAGTTTCTAGTCAGTGTGTTAATCTAACAACTAGAACTCAGTTACCTCCTGCATACACC
***** ************* ** * **** ******* * * *** ***** * * ** ** **
MG772933.1 AATTCTTCACAAAGAGGTGTTTATTACCCTGACACAATTTATAGATCAGACACACTTGTG
MG772934.1 AATTCTTCACAAAGAGGTGTTTATTACCCTGACACAATTTATAGATCTGACACACTAGTG
MN908947.3 AATTCTTTCACACGTGGTGTTTATTACCCTGACAAAGTTTTCAGATCCT-CAGTTTTACA
MN996532.1 AACTCATCCACCCGTGGTGTCTATTACCCTGACAAAGTTTTCAGATCTT-CAGTTTTACA
** ** * * ***** ************* * *** ***** ** *
From the multiple sequence alignment, we can see that MN908947 and MN996532 are more closely related than MG772933 and MG772934 (same results as the BLAST report).
It would be interesting to see what differences there are between the spike proteins of MN996532 and MN908947. To this end, I manually extracted the nucleotide sequences from the alignment of the spike protein for MN996532 and MN908947. Next, I converted the nucleotide sequences into amino acids and aligned them with ClustalW. Below is the amino acid alignment.
CLUSTAL 2.1 multiple sequence alignment
MN996532.1 MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSSTRGVYYPDKVFRSSVLHLTQDLFLPFFS
MN908947.3 MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFRSSVLHSTQDLFLPFFS
******************************* ***************** **********
MN996532.1 NVTWFHAIHVSGTNGIKRFDNPVLPFNDGVYFASTEKSNIIRGWIFGTTLDSKTQSLLIV
MN908947.3 NVTWFHAIHVSGTNGTKRFDNPVLPFNDGVYFASTEKSNIIRGWIFGTTLDSKTQSLLIV
*************** ********************************************
MN996532.1 NNATNVVIKVCEFQFCNDPFLGVYYHKNNKSWMESEFRVYSSANNCTFEYVSQPFLMDLE
MN908947.3 NNATNVVIKVCEFQFCNDPFLGVYYHKNNKSWMESEFRVYSSANNCTFEYVSQPFLMDLE
************************************************************
MN996532.1 GKQGNFKNLREFVFKNIDGYFKIYSKHTPINLVRDLPPGFSALEPLVDLPIGINITRFQT
MN908947.3 GKQGNFKNLREFVFKNIDGYFKIYSKHTPINLVRDLPQGFSALEPLVDLPIGINITRFQT
************************************* **********************
MN996532.1 LLALHRSYLTPGDSSSGWTAGAAAYYVGYLQPRTFLLKYNENGTITDAVDCALDPLSETK
MN908947.3 LLALHRSYLTPGDSSSGWTAGAAAYYVGYLQPRTFLLKYNENGTITDAVDCALDPLSETK
************************************************************
MN996532.1 CTLKSFTVEKGIYQTSNFRVQPTDSIVRFPNITNLCPFGEVFNATTFASVYAWNRKRISN
MN908947.3 CTLKSFTVEKGIYQTSNFRVQPTESIVRFPNITNLCPFGEVFNATRFASVYAWNRKRISN
***********************:********************* **************
MN996532.1 CVADYSVLYNSTSFSTFKCYGVSPTKLNDLCFTNVYADSFVITGDEVRQIAPGQTGKIAD
MN908947.3 CVADYSVLYNSASFSTFKCYGVSPTKLNDLCFTNVYADSFVIRGDEVRQIAPGQTGKIAD
***********:****************************** *****************
MN996532.1 YNYKLPDDFTGCVIAWNSKHIDAKEGGNFNYLYRLFRKANLKPFERDISTEIYQAGSKPC
MN908947.3 YNYKLPDDFTGCVIAWNSNNLDSKVGGNYNYLYRLFRKSNLKPFERDISTEIYQAGSTPC
******************:::*:* ***:*********:******************.**
MN996532.1 NGQTGLNCYYPLYRYGFYPTDGVGHQPYRVVVLSFELLNAPATVCGPKKSTNLVKNKCVN
MN908947.3 NGVEGFNCYFPLQSYGFQPTNGVGYQPYRVVVLSFELLHAPATVCGPKKSTNLVKNKCVN
** *:***:** *** **:***:*************:*********************
MN996532.1 FNFNGLTGTGVLTESNKKFLPFQQFGRDIADTTDAVRDPQTLEILDITPCSFGGVSVITP
MN908947.3 FNFNGLTGTGVLTESNKKFLPFQQFGRDIADTTDAVRDPQTLEILDITPCSFGGVSVITP
************************************************************
MN996532.1 GTNASNQVAVLYQDVNCTEVPVAIHADQLTPTWRVYSTGSNVFQTRAGCLIGAEHVNNSY
MN908947.3 GTNTSNQVAVLYQDVNCTEVPVAIHADQLTPTWRVYSTGSNVFQTRAGCLIGAEHVNNSY
***:********************************************************
MN996532.1 ECDIPIGAGICASYQTQTNS----RSVASQSIIAYTMSLGAENSVAYSNNSIAIPTNFTI
MN908947.3 ECDIPIGAGICASYQTQTNSPRRARSVASQSIIAYTMSLGAENSVAYSNNSIAIPTNFTI
******************** ************************************
MN996532.1 SVTTEILPVSMTKTSVDCTMYICGDSTECSNLLLQYGSFCTQLNRALTGIAVEQDKNTQE
MN908947.3 SVTTEILPVSMTKTSVDCTMYICGDSTECSNLLLQYGSFCTQLNRALTGIAVEQDKNTQE
************************************************************
MN996532.1 VFAQVKQIYKTPPIKDFGGFNFSQILPDPSKPSKRSFIEDLLFNKVTLADAGFIKQYGDC
MN908947.3 VFAQVKQIYKTPPIKDFGGFNFSQILPDPSKPSKRSFIEDLLFNKVTLADAGFIKQYGDC
************************************************************
MN996532.1 LGDIAARDLICAQKFNGLTVLPPLLTDEMIAQYTSALLAGTITSGWTFGAGAALQIPFAM
MN908947.3 LGDIAARDLICAQKFNGLTVLPPLLTDEMIAQYTSALLAGTITSGWTFGAGAALQIPFAM
************************************************************
MN996532.1 QMAYRFNGIGVTQNVLYENQKLIANQFNSAIGKIQDSLSSTASALGKLQDVVNQNAQALN
MN908947.3 QMAYRFNGIGVTQNVLYENQKLIANQFNSAIGKIQDSLSSTASALGKLQDVVNQNAQALN
************************************************************
MN996532.1 TLVKQLSSNFGAISSVLNDILSRLDKVEAEVQIDRLITGRLQSLQTYVTQQLIRAAEIRA
MN908947.3 TLVKQLSSNFGAISSVLNDILSRLDKVEAEVQIDRLITGRLQSLQTYVTQQLIRAAEIRA
************************************************************
MN996532.1 SANLAATKMSECVLGQSKRVDFCGKGYHLMSFPQSAPHGVVFLHVTYVPAQEKNFTTAPA
MN908947.3 SANLAATKMSECVLGQSKRVDFCGKGYHLMSFPQSAPHGVVFLHVTYVPAQEKNFTTAPA
************************************************************
MN996532.1 ICHDGKAHFPREGVFVSNGTHWFVTQRNFYEPQIITTDNTFVSGSCDVVIGIVNNTVYDP
MN908947.3 ICHDGKAHFPREGVFVSNGTHWFVTQRNFYEPQIITTDNTFVSGNCDVVIGIVNNTVYDP
********************************************.***************
MN996532.1 LQPELDSFKEELDKYFKNHTSPDVDLGDISGINASVVNIQKEIDRLNEVAKNLNESLIDL
MN908947.3 LQPELDSFKEELDKYFKNHTSPDVDLGDISGINASVVNIQKEIDRLNEVAKNLNESLIDL
************************************************************
MN996532.1 QELGKYEQYIKWPWYIWLGFIAGLIAIIMVTIMLCCMTSCCSCLKGCCSCGSCCKFDEDD
MN908947.3 QELGKYEQYIKWPWYIWLGFIAGLIAIVMVTIMLCCMTSCCSCLKGCCSCGSCCKFDEDD
***************************:********************************
MN996532.1 SEPVLKGVKLHYT
MN908947.3 SEPVLKGVKLHYT
*************
More to come after I finish reading "Cryo-EM structure of the 2019-nCoV spike in the prefusion conformation". For all code used in this post, please refer to my GitHub.

This work is licensed under a Creative Commons
Attribution 4.0 International License.

Hi Dave,
Can you give any assistance on a question I have. I’m just trying to figure out how, based on RNA sequencing of an entire bronchoavleolar lavage fluid sample, a virus with a partly novel genome is able to be, with a sufficient degree of certainty, ascertained. How is it known that the novel parts do not belong in fact to something else? Is there not a great deal of uncertainty here?
And wouldn’t it a basic part of the error-checking process, if a new coronavirus is suspected, to purify the viral particles of that size and density using a robust density centrifugation protocol, and then sequence the purified virus particles?
Hi John,
thanks for the question. I’m definitely not an expert on metagenome (or in this case metatranscriptome) assembly but here’s my basic understanding. Most assembly programs use a graph-based approach and in the case of the paper described in this post, MEGAHIT (https://www.ncbi.nlm.nih.gov/pubmed/25609793) was used. Each read is represented as a node (or vertex) and edges connect reads that have overlapping sequences. A read that has all but 1 nucleotide overlapping with another read most likely came from the same source and will be connected; an assembly is built as all reads are connected. Figure 3 of this paper (https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1000667) has a nice figure of this process and provides a better description of the assembly process.
Since there are some differences amongst all the viruses, different contigs will be assembled. The authors of the paper used RACE (https://en.wikipedia.org/wiki/Rapid_amplification_of_cDNA_ends) to verify the start and end of the SARS-CoV-2 genome.
Best,
Dave
Hi Dave,
Thanks for the info!
Hi Dave,
You are a legend! Thanks a million for the effort!
When I run `conda env create –file environment.yml` it gets stuck at “Solving Environment”. But if I use `mamba` it install everything and even instantaneously. Why do you think so?
Hi Asuman,
Mamba uses libsolv for solving dependencies, which is much faster than Conda’s implementation. So please use Mamba.
Cheers,
Dave