The article "A new coronavirus associated with human respiratory disease in China" released the full genome sequence (29,903 nt) of Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). In the paper, meta-transcriptomic sequencing was performed on bronchoalveolar lavage fluid (BALF) from a 41 year old male suffering from a severe respiratory disease. Contigs were assembled using MEGAHIT and the longest contig that was highly abundance, matched a bat SARS-like coronavirus (CoV) isolate (MG772933). RT-PCR and RACE were used to determine and confirm the genome sequence.
For this post, we will be using Entrez Direct, BLAST, and ClustalW. I created an environment.yml file that will install these tools into a Conda environment called sars_cov_2. This file and all other files used for this post are available on GitHub.
name: sars_cov_2 channels: - bioconda - anaconda - conda-forge - defaults dependencies: - entrez-direct - blast - clustalw
Create the environment and activate it.
conda env create --file environment.yml conda activate sars_cov_2
We can use efetch to retrieve the sequence deposited by Wu et al.
efetch -db sequences -format fasta -id MN908947 > raw/MN908947.fa head -3 raw/MN908947.fa >MN908947.3 Severe acute respiratory syndrome coronavirus 2 isolate Wuhan-Hu-1, complete genome ATTAAAGGTTTATACCTTCCCAGGTAACAAACCAACCAACTTTCGATCTCTTGTAGATCTGTTCTCTAAA CGAACTTTAAAATCTGTGTGGCTGTCACTCGGCTGCATGCTTAGTGCACTCACGCAGTATAATTAATAAC
I'm interested in what other coronaviruses SARS-CoV-2 matches. We can use esearch to perform a query on how many sequences are associated with the term "coronavirus".
esearch -db nuccore -query coronavirus <ENTREZ_DIRECT> <Db>nuccore</Db> <WebEnv>NCID_1_121026101_130.14.22.33_9001_1583054896_904858778_0MetA0_S_MegaStore</WebEnv> <QueryKey>1</QueryKey> <Count>41874</Count> <Step>1</Step> </ENTREZ_DIRECT>
As of 2020/03/01 there were 41874 results. Next we will pipe the esearch with efetch to retrieve these sequences.
esearch -db nuccore -query coronavirus | efetch -db sequences -format fasta > raw/coronavirus_20200301.fa
I analysed the FASTA sequence definition lines (for a later post) and while most of entries were clearly coronavirus related, some were not. This doesn't matter too much since their sequences will not match SARS-CoV-2.
Next, we will use these FASTA sequences to create a BLAST database.
mkdir db makeblastdb -dbtype nucl \ -in raw/coronavirus_20200301.fa \ -input_type fasta \ -title coronavirus_20200301 \ -out db/coronavirus_20200301 Building a new DB, current time: 03/01/2020 19:25:30 New DB name: /Users/dtang/github/sars_cov_2/db/coronavirus_20200301 New DB title: coronavirus_20200301 Sequence type: Nucleotide Keep MBits: T Maximum file size: 1000000000B Adding sequences from FASTA; added 41874 sequences in 9.91486 seconds.
Finally we'll BLAST our SARS-CoV-2 sequence to all the "coronavirus-related" sequences.
blastn -outfmt 7 -query raw/MN908947.fa -db db/coronavirus_20200301 > result/MN908947_blast.txt
I wrote a script to parse the BLAST results. The columns are the accession, alignment length, percentage match, number of mismatches, and the definition in the FASTA header.
script/parse_outfmt7.pl -i result/MN908947_blast.txt -p 80 -l 10000 -f raw/coronavirus_20200301.fa | head NC_045512.2 29903 100.000 0 Wuhan seafood market pneumonia virus isolate Wuhan-Hu-1, complete genome MN908947.3 29903 100.000 0 Severe acute respiratory syndrome coronavirus 2 isolate Wuhan-Hu-1, complete genome MT019531.1 29899 99.997 1 Severe acute respiratory syndrome coronavirus 2 isolate BetaCoV/Wuhan/IPBCAMS-WH-03/2019, complete genome MN996528.1 29891 100.000 0 Severe acute respiratory syndrome coronavirus 2 isolate WIV04, complete genome MT019529.1 29899 99.990 3 Severe acute respiratory syndrome coronavirus 2 isolate BetaCoV/Wuhan/IPBCAMS-WH-01/2019, complete genome MT019532.1 29890 100.000 0 Severe acute respiratory syndrome coronavirus 2 isolate BetaCoV/Wuhan/IPBCAMS-WH-04/2019, complete genome MT049951.1 29903 99.983 5 Severe acute respiratory syndrome coronavirus 2 isolate SARS-CoV-2/Yunnan-01/human/2020/CHN, complete genome MN988668.1 29881 100.000 0 Severe acute respiratory syndrome coronavirus 2 isolate 2019-nCoV WHU01, complete genome MN988669.1 29881 100.000 0 Severe acute respiratory syndrome coronavirus 2 isolate 2019-nCoV WHU02, complete genome MT019533.1 29883 99.997 1 Severe acute respiratory syndrome coronavirus 2 isolate BetaCoV/Wuhan/IPBCAMS-WH-05/2020, complete genome
The top BLAST results are matches to other isolated sequenced at different laboratories. If we remove results to "Severe acute respiratory syndrome coronavirus 2", we can see the same match to MG772933 (Bat SARS-like coronavirus isolate bat-SL-CoVZC45) with the same nucleotide identity of 89.1% as reported in Wu et al.
script/parse_outfmt7.pl -i result/MN908947_blast.txt -p 80 -l 10000 -f raw/coronavirus_20200301.fa | grep -v "Severe acute respiratory syndrome coronavirus 2" | head NC_045512.2 29903 100.000 0 Wuhan seafood market pneumonia virus isolate Wuhan-Hu-1, complete genome LR757996.1 29868 100.000 0 Wuhan seafood market pneumonia virus genome assembly, chromosome: whole_genome LR757995.1 29871 99.993 2 Wuhan seafood market pneumonia virus genome assembly, chromosome: whole_genome LR757998.1 29866 99.993 2 Wuhan seafood market pneumonia virus genome assembly, chromosome: whole_genome MN996532.1 29877 96.117 1136 Bat coronavirus RaTG13, complete genome MG772933.1 21729 89.116 2278 Bat SARS-like coronavirus isolate bat-SL-CoVZC45, complete genome MG772934.1 18272 88.655 2036 Bat SARS-like coronavirus isolate bat-SL-CoVZXC21, complete genome AY394996.1 17716 82.344 2941 SARS coronavirus ZS-B, complete genome AY394997.1 17716 82.344 2941 SARS coronavirus ZS-A, complete genome AY395003.1 17716 82.344 2941 SARS coronavirus ZS-C, complete genome
The article "A pneumonia outbreak associated with a new coronavirus of probable bat origin" discusses the close sequence match to Bat coronavirus RaTG13, which may support the hypothesis that SARS-CoV-2 spilled over from bats. Specifically, it is the receptor-binding domain of the spike protein that enables the viruses to enter cells via the ACE2 receptor; this is the second ORF from left to right in the annotation below.
We will use ClustalW to align MN908947 (SARS-CoV-2), MG772933 (bat-SL-CoVZC45), MG772934 (bat-SL-CoVZXC21), and MN996532 (RaTG13) to determine the specific sequence differences between these viruses.
# I manually created wanted.txt by pasting the accessions script/extract_fasta.pl -i raw/wanted.txt -f raw/coronavirus_20200301.fa > raw/wanted.fa clustalw -infile=raw/wanted.fa
Below is the multiple sequence alignment of the beginning of the spike protein.
MG772933.1 ATGTT-GTTTTTCTTGTTTCTTCAGTTCGCCTTAGTAAACTCCCAGTGTGTTAACTTGACAGGCAGAACCCCACTCAATCCCAATTATACT MG772934.1 ATGTT-GTTTTTCTTGTTTCTTCAGTTCGCCTTAGTAAACTCCCAGTGTG---ATTTGACAGGTAGAACTCCACTCAATCCCAATTATACT MN908947.3 ATGTTTGTTTTTCTTGTTTT----ATTGCCACTAGTCTCTAGTCAGTGTGTTAATCTTACAACCAGAACTCAATTACCCCCTGCATACACT MN996532.1 ATGTTTGTTTTTCTTGTTTT----ATTGCCACTAGTTTCTAGTCAGTGTGTTAATCTAACAACTAGAACTCAGTTACCTCCTGCATACACC ***** ************* ** * **** ******* * * *** ***** * * ** ** ** MG772933.1 AATTCTTCACAAAGAGGTGTTTATTACCCTGACACAATTTATAGATCAGACACACTTGTG MG772934.1 AATTCTTCACAAAGAGGTGTTTATTACCCTGACACAATTTATAGATCTGACACACTAGTG MN908947.3 AATTCTTTCACACGTGGTGTTTATTACCCTGACAAAGTTTTCAGATCCT-CAGTTTTACA MN996532.1 AACTCATCCACCCGTGGTGTCTATTACCCTGACAAAGTTTTCAGATCTT-CAGTTTTACA ** ** * * ***** ************* * *** ***** ** *
From the multiple sequence alignment, we can see that MN908947 and MN996532 are more closely related than MG772933 and MG772934 (same results as the BLAST report).
It would be interesting to see what differences there are between the spike proteins of MN996532 and MN908947. To this end, I manually extracted the nucleotide sequences from the alignment of the spike protein for MN996532 and MN908947. Next, I converted the nucleotide sequences into amino acids and aligned them with ClustalW. Below is the amino acid alignment.
CLUSTAL 2.1 multiple sequence alignment MN996532.1 MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSSTRGVYYPDKVFRSSVLHLTQDLFLPFFS MN908947.3 MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFRSSVLHSTQDLFLPFFS ******************************* ***************** ********** MN996532.1 NVTWFHAIHVSGTNGIKRFDNPVLPFNDGVYFASTEKSNIIRGWIFGTTLDSKTQSLLIV MN908947.3 NVTWFHAIHVSGTNGTKRFDNPVLPFNDGVYFASTEKSNIIRGWIFGTTLDSKTQSLLIV *************** ******************************************** MN996532.1 NNATNVVIKVCEFQFCNDPFLGVYYHKNNKSWMESEFRVYSSANNCTFEYVSQPFLMDLE MN908947.3 NNATNVVIKVCEFQFCNDPFLGVYYHKNNKSWMESEFRVYSSANNCTFEYVSQPFLMDLE ************************************************************ MN996532.1 GKQGNFKNLREFVFKNIDGYFKIYSKHTPINLVRDLPPGFSALEPLVDLPIGINITRFQT MN908947.3 GKQGNFKNLREFVFKNIDGYFKIYSKHTPINLVRDLPQGFSALEPLVDLPIGINITRFQT ************************************* ********************** MN996532.1 LLALHRSYLTPGDSSSGWTAGAAAYYVGYLQPRTFLLKYNENGTITDAVDCALDPLSETK MN908947.3 LLALHRSYLTPGDSSSGWTAGAAAYYVGYLQPRTFLLKYNENGTITDAVDCALDPLSETK ************************************************************ MN996532.1 CTLKSFTVEKGIYQTSNFRVQPTDSIVRFPNITNLCPFGEVFNATTFASVYAWNRKRISN MN908947.3 CTLKSFTVEKGIYQTSNFRVQPTESIVRFPNITNLCPFGEVFNATRFASVYAWNRKRISN ***********************:********************* ************** MN996532.1 CVADYSVLYNSTSFSTFKCYGVSPTKLNDLCFTNVYADSFVITGDEVRQIAPGQTGKIAD MN908947.3 CVADYSVLYNSASFSTFKCYGVSPTKLNDLCFTNVYADSFVIRGDEVRQIAPGQTGKIAD ***********:****************************** ***************** MN996532.1 YNYKLPDDFTGCVIAWNSKHIDAKEGGNFNYLYRLFRKANLKPFERDISTEIYQAGSKPC MN908947.3 YNYKLPDDFTGCVIAWNSNNLDSKVGGNYNYLYRLFRKSNLKPFERDISTEIYQAGSTPC ******************:::*:* ***:*********:******************.** MN996532.1 NGQTGLNCYYPLYRYGFYPTDGVGHQPYRVVVLSFELLNAPATVCGPKKSTNLVKNKCVN MN908947.3 NGVEGFNCYFPLQSYGFQPTNGVGYQPYRVVVLSFELLHAPATVCGPKKSTNLVKNKCVN ** *:***:** *** **:***:*************:********************* MN996532.1 FNFNGLTGTGVLTESNKKFLPFQQFGRDIADTTDAVRDPQTLEILDITPCSFGGVSVITP MN908947.3 FNFNGLTGTGVLTESNKKFLPFQQFGRDIADTTDAVRDPQTLEILDITPCSFGGVSVITP ************************************************************ MN996532.1 GTNASNQVAVLYQDVNCTEVPVAIHADQLTPTWRVYSTGSNVFQTRAGCLIGAEHVNNSY MN908947.3 GTNTSNQVAVLYQDVNCTEVPVAIHADQLTPTWRVYSTGSNVFQTRAGCLIGAEHVNNSY ***:******************************************************** MN996532.1 ECDIPIGAGICASYQTQTNS----RSVASQSIIAYTMSLGAENSVAYSNNSIAIPTNFTI MN908947.3 ECDIPIGAGICASYQTQTNSPRRARSVASQSIIAYTMSLGAENSVAYSNNSIAIPTNFTI ******************** ************************************ MN996532.1 SVTTEILPVSMTKTSVDCTMYICGDSTECSNLLLQYGSFCTQLNRALTGIAVEQDKNTQE MN908947.3 SVTTEILPVSMTKTSVDCTMYICGDSTECSNLLLQYGSFCTQLNRALTGIAVEQDKNTQE ************************************************************ MN996532.1 VFAQVKQIYKTPPIKDFGGFNFSQILPDPSKPSKRSFIEDLLFNKVTLADAGFIKQYGDC MN908947.3 VFAQVKQIYKTPPIKDFGGFNFSQILPDPSKPSKRSFIEDLLFNKVTLADAGFIKQYGDC ************************************************************ MN996532.1 LGDIAARDLICAQKFNGLTVLPPLLTDEMIAQYTSALLAGTITSGWTFGAGAALQIPFAM MN908947.3 LGDIAARDLICAQKFNGLTVLPPLLTDEMIAQYTSALLAGTITSGWTFGAGAALQIPFAM ************************************************************ MN996532.1 QMAYRFNGIGVTQNVLYENQKLIANQFNSAIGKIQDSLSSTASALGKLQDVVNQNAQALN MN908947.3 QMAYRFNGIGVTQNVLYENQKLIANQFNSAIGKIQDSLSSTASALGKLQDVVNQNAQALN ************************************************************ MN996532.1 TLVKQLSSNFGAISSVLNDILSRLDKVEAEVQIDRLITGRLQSLQTYVTQQLIRAAEIRA MN908947.3 TLVKQLSSNFGAISSVLNDILSRLDKVEAEVQIDRLITGRLQSLQTYVTQQLIRAAEIRA ************************************************************ MN996532.1 SANLAATKMSECVLGQSKRVDFCGKGYHLMSFPQSAPHGVVFLHVTYVPAQEKNFTTAPA MN908947.3 SANLAATKMSECVLGQSKRVDFCGKGYHLMSFPQSAPHGVVFLHVTYVPAQEKNFTTAPA ************************************************************ MN996532.1 ICHDGKAHFPREGVFVSNGTHWFVTQRNFYEPQIITTDNTFVSGSCDVVIGIVNNTVYDP MN908947.3 ICHDGKAHFPREGVFVSNGTHWFVTQRNFYEPQIITTDNTFVSGNCDVVIGIVNNTVYDP ********************************************.*************** MN996532.1 LQPELDSFKEELDKYFKNHTSPDVDLGDISGINASVVNIQKEIDRLNEVAKNLNESLIDL MN908947.3 LQPELDSFKEELDKYFKNHTSPDVDLGDISGINASVVNIQKEIDRLNEVAKNLNESLIDL ************************************************************ MN996532.1 QELGKYEQYIKWPWYIWLGFIAGLIAIIMVTIMLCCMTSCCSCLKGCCSCGSCCKFDEDD MN908947.3 QELGKYEQYIKWPWYIWLGFIAGLIAIVMVTIMLCCMTSCCSCLKGCCSCGSCCKFDEDD ***************************:******************************** MN996532.1 SEPVLKGVKLHYT MN908947.3 SEPVLKGVKLHYT *************
More to come after I finish reading "Cryo-EM structure of the 2019-nCoV spike in the prefusion conformation". For all code used in this post, please refer to my GitHub.
This work is licensed under a Creative Commons
Attribution 4.0 International License.
Hi Dave,
Can you give any assistance on a question I have. I’m just trying to figure out how, based on RNA sequencing of an entire bronchoavleolar lavage fluid sample, a virus with a partly novel genome is able to be, with a sufficient degree of certainty, ascertained. How is it known that the novel parts do not belong in fact to something else? Is there not a great deal of uncertainty here?
And wouldn’t it a basic part of the error-checking process, if a new coronavirus is suspected, to purify the viral particles of that size and density using a robust density centrifugation protocol, and then sequence the purified virus particles?
Hi John,
thanks for the question. I’m definitely not an expert on metagenome (or in this case metatranscriptome) assembly but here’s my basic understanding. Most assembly programs use a graph-based approach and in the case of the paper described in this post, MEGAHIT (https://www.ncbi.nlm.nih.gov/pubmed/25609793) was used. Each read is represented as a node (or vertex) and edges connect reads that have overlapping sequences. A read that has all but 1 nucleotide overlapping with another read most likely came from the same source and will be connected; an assembly is built as all reads are connected. Figure 3 of this paper (https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1000667) has a nice figure of this process and provides a better description of the assembly process.
Since there are some differences amongst all the viruses, different contigs will be assembled. The authors of the paper used RACE (https://en.wikipedia.org/wiki/Rapid_amplification_of_cDNA_ends) to verify the start and end of the SARS-CoV-2 genome.
Best,
Dave
Hi Dave,
Thanks for the info!
Hi Dave,
You are a legend! Thanks a million for the effort!
When I run `conda env create –file environment.yml` it gets stuck at “Solving Environment”. But if I use `mamba` it install everything and even instantaneously. Why do you think so?
Hi Asuman,
Mamba uses libsolv for solving dependencies, which is much faster than Conda’s implementation. So please use Mamba.
Cheers,
Dave