In this post, I describe a simple workflow for identifying Omicron variants from some sequencing data shared by the Kwazulu-Natal Research Innovation and Sequencing Platform (KRISP) and the Centre for Epidemic Response and Innovation (CERI). To follow the workflow, you need to have Docker installed and an Internet connection. To get started, run a container using davetang/build:1.1, which is an image I prepared for carrying out bioinformatic analysis. If you need some background on Docker, I have some notes on GitHub.
docker run --rm -it -v $(pwd):$(pwd) -w $(pwd) davetang/build:1.1 /bin/bash # run these commands once you are inside the container mkdir -p tools/bin export PATH=$(pwd)/tools/bin:$(pwd)/src/edirect:$(pwd)/src/sratoolkit.2.11.3-ubuntu64/bin/:$PATH # run this ONLY if you have already installed SRA Toolkit # push x when you see the config screen vdb-config --interactive
The docker run command will mount your current directory to the Docker container and you will be in the same directory when your container starts. All of the commands in this post will save data to this directory and therefore when you exit the container, your data will still persist on your local computer. In addition, we will install all tools inside the tools directory, so make sure you also run the export PATH command inside the Docker container as per above.
If you have exited the container and want to resume your analysis in a new container, make sure you run the commands above in the same directory when you first ran the docker run command.
Tools
We need various tools for the workflow and you should be able to just copy and paste the commands below to install them from inside the Docker container. We will download each tool in the src directory.
Set up Entrez Direct.
mkdir src && cd src url=ftp://ftp.ncbi.nlm.nih.gov/entrez/entrezdirect wget ${url}/edirect.tar.gz tar -xzf edirect.tar.gz cd edirect plt=Linux for exc in xtract rchive transmute; do wget ${url}/${exc}.${plt}.gz gunzip ${exc}.${plt}.gz chmod 755 ${exc}.${plt} done cd ..
Set up the SRA Toolkit and datasets. For the SRA Toolkit, we need to run vdb-config first in order to use it; when you see the setup screen, just push the letter "x" on your keyboard (without the double quotes).
wget https://ftp-trace.ncbi.nlm.nih.gov/sra/sdk/2.11.3/sratoolkit.2.11.3-ubuntu64.tar.gz tar -xzf sratoolkit.2.11.3-ubuntu64.tar.gz wget https://ftp.ncbi.nlm.nih.gov/pub/datasets/command-line/LATEST/linux-amd64/datasets chmod 755 datasets && mv datasets ../tools/bin # press x vdb-config --interactive
Next up, we need SnpEff for annotating our variants. After downloading the tool, we will download the reference sequence for SARS-CoV-2 (NC_045512.2).
wget https://snpeff.blob.core.windows.net/versions/snpEff_latest_core.zip unzip snpEff_latest_core.zip cd snpEff java -jar snpEff.jar download NC_045512.2 cd ..
We will use vt to decompose our variants, i.e. display multiallelic variants on multiple lines instead of one.
git clone https://github.com/atks/vt.git cd vt git submodule update --init --recursive make make test mv vt ../../tools/bin cd ..
We will use BWA for aligning the reads.
wget https://github.com/lh3/bwa/archive/refs/tags/v0.7.17.tar.gz tar -xzf v0.7.17.tar.gz cd bwa-0.7.17 && make && mv bwa ../../tools/bin cd ..
Finally, we need BCFtools, Samtools, and HTSlib, which are a staple of bioinformatic high-throughput sequence analysis.
ver=1.14 dir=$(pwd)/../tools for tool in htslib bcftools samtools; do url=https://github.com/samtools/${tool}/releases/download/${ver}/${tool}-${ver}.tar.bz2 wget ${url} tar xjf ${tool}-${ver}.tar.bz2 cd ${tool}-${ver} ./configure --prefix=${dir} make && make install cd .. done
Download data
We need a reference sequence to align our reads to and call variants against. GCF_009858895.2 is a RefSeq assembly, which means it is maintained by NCBI; this assembly is based on NC_045512.2.
Before running the following commands, move back to the working directory (i.e. navigate one directory above src), if you haven't already; this is the directory you were initially in when you first started the Docker container.
mkdir ref && cd ref datasets download genome accession GCF_009858895.2 --filename GCF_009858895.2.zip --include-gbff --include-gtf unzip GCF_009858895.2.zip ln -s ncbi_dataset/data/GCF_009858895.2/GCF_009858895.2_ASM985889v3_genomic.fna ref.fa bwa index ref.fa cd ..
Now to download the data, which took me around an hour on my (relatively fast) home Internet.
project=PRJNA784038 if [[ ! -e runinfo.csv ]]; then >&2 echo Generating runinfo.tsv esearch -db sra -query ${project} | efetch -format runinfo > runinfo.csv fi time for acc in $(cat runinfo.csv | grep "ILLUMINA" | cut -f1 -d','); do >&2 echo Downloading ${acc}; fasterq-dump -p --outdir fastq ${acc} done
Align
Now to align the raw data. Once again, you should be in your working directory. Change thread to a higher number if you have more available threads.
mkdir bam && cd bam ref=../ref/ref.fa thread=8 for acc in $(cat ../runinfo.csv | grep "ILLUMINA" | cut -f1 -d','); do platform=$(cat ../runinfo.csv | grep ${acc} | cut -f19 -d',') sample=$(cat ../runinfo.csv | grep ${acc} | cut -f25 -d',') if [[ -e ../fastq/${acc}_1.fastq && ../fastq/${acc}_2.fastq ]]; then echo Aligning ${acc} bwa mem \ -t ${thread} \ -R "@RG\tID:${acc}\tSM:${sample}\tPL:${platform}" \ ${ref} \ ../fastq/${acc}_1.fastq \ ../fastq/${acc}_2.fastq | samtools sort -@ ${thread} -O BAM |\ tee ${acc}.bam |\ samtools index - ${acc}.bam.bai fi done cd ..
Call variants
We will use BCFtools for calling variants.
mkdir vcf && cd vcf ref=../ref/ref.fa for bam in $(ls ../bam/*.bam); do echo ${bam} base=$(basename ${bam} .bam) bcftools mpileup -d 10000 -O v -f ${ref} ${bam} | bcftools call -mv -O v -o ${base}.vcf bgzip ${base}.vcf tabix -p vcf ${base}.vcf.gz done bcftools merge -o PRJNA784038_illumina.vcf -O v SRR*.vcf.gz vt decompose PRJNA784038_illumina.vcf -o PRJNA784038_illumina.vt.vcf java -Xmx8g -jar ../src/snpEff/snpEff.jar NC_045512.2 PRJNA784038_illumina.vt.vcf > PRJNA784038_illumina.vt.ann.vcf bgzip PRJNA784038_illumina.vt.ann.vcf tabix -p vcf PRJNA784038_illumina.vt.ann.vcf.gz
Spike protein mutations
As you probably heard by now, the Omicron variant has over 30 mutations in the spike protein. The genomic coordinates for the spike protein are between 21563 and 25384, so we can subset variants in our VCF file to correspond to mutations in the spike protein. In addition, we will use BCFtools to only report variants that were called in over 30 of the samples. Some genomic positions have more than one alternate allele but since we used vt, they are decomposed onto multiple lines. To count the number of sites with one or more mutations, we can sort for unique positions.
bcftools filter -i 'AN>30 & POS > 21562 & POS < 25385' PRJNA784038_illumina.vt.ann.vcf.gz | grep -v "^#" | wc -l 35 bcftools filter -i 'AN>30 & POS > 21562 & POS < 25385' PRJNA784038_illumina.vt.ann.vcf.gz | grep -v "^#" | cut -f1,2 | sort -u | wc -l 31
To list the variants in HGVS nomenclature, we can use the extractFields subtool from SnpSift.
java -Xmx8g -jar ../src/snpEff/SnpSift.jar extractFields PRJNA784038_illumina.vt.ann.vcf.gz CHROM POS AN "ANN[*].HGVS_P" | cut -f1-4 | awk '$2>21562 && $2<25385 && $3 > 30 {print}' NC_045512.2 21762 132 p.Ala67Val NC_045512.2 21768 70 p.His69Leu NC_045512.2 21770 44 p.Val70Ile NC_045512.2 21770 44 p.Val70Leu NC_045512.2 21770 44 p.Val70Phe NC_045512.2 21846 134 p.Thr95Ile NC_045512.2 21995 36 p.Tyr145Asp NC_045512.2 21995 36 p.Tyr145Asn NC_045512.2 22194 122 p.Asn211Ile NC_045512.2 22578 130 p.Gly339Asp NC_045512.2 22673 126 p.Ser371Pro NC_045512.2 22674 126 p.Ser371Phe NC_045512.2 22679 126 p.Ser373Pro NC_045512.2 22686 126 p.Ser375Phe NC_045512.2 22992 128 p.Ser477Asn NC_045512.2 22995 132 p.Thr478Lys NC_045512.2 23013 130 p.Glu484Ala NC_045512.2 23040 130 p.Gln493Arg NC_045512.2 23048 128 p.Gly496Ser NC_045512.2 23055 130 p.Gln498Arg NC_045512.2 23063 132 p.Asn501Tyr NC_045512.2 23075 130 p.Tyr505His NC_045512.2 23202 132 p.Thr547Lys NC_045512.2 23403 136 p.Asp614Gly NC_045512.2 23525 130 p.His655Tyr NC_045512.2 23599 132 p.Asn679Lys NC_045512.2 23599 132 p.Asn679Lys NC_045512.2 23604 130 p.Pro681His NC_045512.2 23854 78 p.Asn764Lys NC_045512.2 23948 130 p.Asp796Tyr NC_045512.2 24130 132 p.Asn856Lys NC_045512.2 24424 134 p.Gln954His NC_045512.2 24469 136 p.Asn969Lys NC_045512.2 24503 136 p.Leu981Phe NC_045512.2 25000 134 p.Asp1146Asp
Many of the variants above are already reported on CoVariants. If you use this data, please cite Rapid epidemic expansion of the SARS-CoV-2 Omicron variant in southern Africa.
For those that do not want to run the entire workflow but want to analyse the variant calls, I have hosted the final VCF file (and the index file) with the Omicron variants on my web server.
Finally, I have some other analyses and notes on SARS-CoV-2 on my GitHub.
This work is licensed under a Creative Commons
Attribution 4.0 International License.
Hello Dave,
Great post, as usual. I have a couple of questions:
1. Every time I exit out from the docker, will I lose all the programs I installed – eg BWA, SNPEff, etc? Is there a way to not to install all of these programs every time?
2. The commands to count the number of sites with one or more mutations is not working (bcftools filter -i ‘AN>30 & POS > 21562 & POS < 25385' PRJNA784038_illumina.vt.ann.vcf.gz | grep -v "^#" | wc -l)
I am getting, "the file ""25385' PRJNA784038_illumina.vt.ann.vcf.gz"" not recognizable ???
Thx a lot!
Hi Akhil,
1. If you start Docker as per my command at the start of the post, all the tools, data, and results will be in the directory where you ran the Docker command.
2. It looks like the quote before AN and the quote after 25385 are unmatched. I’m not sure how that happened because if I copy and paste the command from my blog post I get matched single quotes.
Hope that helps,
Dave
Great, sounds good. Let me test it out and I will update you soon with my findings/success. Thanks once again and can’t wait to read your next post 🙂
Using the docker command I installed all the tools in the “pwd”.
I exited out of the terminal.
Using your docker command (docker run –rm -it -v $(pwd):$(pwd) -w $(pwd) davetang/build:1.1 /bin/bash) I again entered inside the environment.
I do see the “src” and other directories.
But when I try to use the command/tool, say “fasterq”, to download the data, I get the following error, as if the system is not able to recognize the tool:
Downloading SRR17051948
bash: fasterq-dump: command not found
Downloading SRR17051947
bash: fasterq-dump: command not found
Downloading SRR17051946
bash: fasterq-dump: command not found
Downloading SRR17051944
bash: fasterq-dump: command not found
Downloading SRR17051943
bash: fasterq-dump: command not found
What am I doing wrong?
Could it be that I am not able get into the correct container???
Thanks
Hi Akhil,
you’re doing nothing wrong; I wrote the post without the intention of re-running the container. You are getting an error because the PATH no longer contains the directories of where the tools are installed and therefore the shell cannot find the tools.
What you need to do is to follow all the “cd” commands and re-enter the export PATH commands. However, since the post installs BWA, vt, SAMtools, BCFtools, and HTSlib into /usr/local/bin (inside the container), you will need to re-run those installations, for now.
I’ll modify the post later so the tools will be installed into a common directory on the mounted volume instead, so you don’t need to reinstall them. Then you just need to run the export command once.
Cheers,
Dave
Fantastic, thx a lot, that makes sense.
On a separate note, I was able to count the number of sites with one or more mutations and was able to use extractFields. I am so excited!
Will be waiting eagerly to get the modified code so that one doesn’t have to re-install the tools, every time 🙂
Also will be waiting to read more blog-posts from you regarding efficient and reproducible bioinformatics workflows 🙂
Thanks and take care!
Happy to hear that you could reproduce the analysis! I have updated the post so that the tools can be found in a new container instance. You will have to re-run some of the commands so that the tools will be copied to the tools directory. Thanks and take care as well!
Thanks a lot Dave, I re-installed all the tools, data and performed analysis. Then I re-entered the environment and was able to re-use the previously installed tools. Yaaay.
Once again thx a lot, it is just amazing.
Hi Dave,
Thanks for sharing the variant calling pipeline. Multiallelic calls can be split using bcftools as well, using the `norm` option.
Ref: https://samtools.github.io/bcftools/bcftools.html#norm
Hi there,
thanks for the comment! I recently found out about `bcftools norm` and have added it to my notes:
https://github.com/davetang/learning_vcf_file#decomposing-and-normalising-variants
Cheers,
Dave