Omicron variants

In this post, I describe a simple workflow for identifying Omicron variants from some sequencing data shared by the Kwazulu-Natal Research Innovation and Sequencing Platform (KRISP) and the Centre for Epidemic Response and Innovation (CERI). To follow the workflow, you need to have Docker installed and an Internet connection. To get started, run a container using davetang/build:1.1, which is an image I prepared for carrying out bioinformatic analysis. If you need some background on Docker, I have some notes on GitHub.

docker run --rm -it -v $(pwd):$(pwd) -w $(pwd) davetang/build:1.1 /bin/bash

# run these commands once you are inside the container
mkdir -p tools/bin
export PATH=$(pwd)/tools/bin:$(pwd)/src/edirect:$(pwd)/src/sratoolkit.2.11.3-ubuntu64/bin/:$PATH

# run this ONLY if you have already installed SRA Toolkit
# push x when you see the config screen
vdb-config --interactive

The docker run command will mount your current directory to the Docker container and you will be in the same directory when your container starts. All of the commands in this post will save data to this directory and therefore when you exit the container, your data will still persist on your local computer. In addition, we will install all tools inside the tools directory, so make sure you also run the export PATH command inside the Docker container as per above.

If you have exited the container and want to resume your analysis in a new container, make sure you run the commands above in the same directory when you first ran the docker run command.

Tools

We need various tools for the workflow and you should be able to just copy and paste the commands below to install them from inside the Docker container. We will download each tool in the src directory.

Set up Entrez Direct.

mkdir src && cd src
url=ftp://ftp.ncbi.nlm.nih.gov/entrez/entrezdirect

wget ${url}/edirect.tar.gz
tar -xzf edirect.tar.gz

cd edirect
plt=Linux
for exc in xtract rchive transmute; do
   wget ${url}/${exc}.${plt}.gz
   gunzip ${exc}.${plt}.gz
   chmod 755 ${exc}.${plt}
done
cd ..

Set up the SRA Toolkit and datasets. For the SRA Toolkit, we need to run vdb-config first in order to use it; when you see the setup screen, just push the letter "x" on your keyboard (without the double quotes).

wget https://ftp-trace.ncbi.nlm.nih.gov/sra/sdk/2.11.3/sratoolkit.2.11.3-ubuntu64.tar.gz
tar -xzf sratoolkit.2.11.3-ubuntu64.tar.gz

wget https://ftp.ncbi.nlm.nih.gov/pub/datasets/command-line/LATEST/linux-amd64/datasets
chmod 755 datasets && mv datasets ../tools/bin

# press x
vdb-config --interactive

Next up, we need SnpEff for annotating our variants. After downloading the tool, we will download the reference sequence for SARS-CoV-2 (NC_045512.2).

wget https://snpeff.blob.core.windows.net/versions/snpEff_latest_core.zip
unzip snpEff_latest_core.zip

cd snpEff
java -jar snpEff.jar download NC_045512.2
cd ..

We will use vt to decompose our variants, i.e. display multiallelic variants on multiple lines instead of one.

git clone https://github.com/atks/vt.git  
cd vt 
git submodule update --init --recursive 
make 
make test
mv vt ../../tools/bin
cd ..

We will use BWA for aligning the reads.

wget https://github.com/lh3/bwa/archive/refs/tags/v0.7.17.tar.gz
tar -xzf v0.7.17.tar.gz
cd bwa-0.7.17 && make && mv bwa ../../tools/bin
cd ..

Finally, we need BCFtools, Samtools, and HTSlib, which are a staple of bioinformatic high-throughput sequence analysis.

ver=1.14
dir=$(pwd)/../tools

for tool in htslib bcftools samtools; do
   url=https://github.com/samtools/${tool}/releases/download/${ver}/${tool}-${ver}.tar.bz2
   wget ${url}
   tar xjf ${tool}-${ver}.tar.bz2
   cd ${tool}-${ver}
   ./configure --prefix=${dir}
   make && make install
   cd ..
done

Download data

We need a reference sequence to align our reads to and call variants against. GCF_009858895.2 is a RefSeq assembly, which means it is maintained by NCBI; this assembly is based on NC_045512.2.

Before running the following commands, move back to the working directory (i.e. navigate one directory above src), if you haven't already; this is the directory you were initially in when you first started the Docker container.

mkdir ref && cd ref
datasets download genome accession GCF_009858895.2 --filename GCF_009858895.2.zip --include-gbff --include-gtf
unzip GCF_009858895.2.zip
ln -s ncbi_dataset/data/GCF_009858895.2/GCF_009858895.2_ASM985889v3_genomic.fna ref.fa
bwa index ref.fa
cd ..

Now to download the data, which took me around an hour on my (relatively fast) home Internet.

project=PRJNA784038

if [[ ! -e runinfo.csv ]]; then
   >&2 echo Generating runinfo.tsv
   esearch -db sra -query ${project} | efetch -format runinfo > runinfo.csv
fi

time for acc in $(cat runinfo.csv | grep "ILLUMINA" | cut -f1 -d','); do
   >&2 echo Downloading ${acc};
   fasterq-dump -p --outdir fastq ${acc}
done

Align

Now to align the raw data. Once again, you should be in your working directory. Change thread to a higher number if you have more available threads.

mkdir bam && cd bam
ref=../ref/ref.fa
thread=8

for acc in $(cat ../runinfo.csv | grep "ILLUMINA" | cut -f1 -d','); do

   platform=$(cat ../runinfo.csv | grep ${acc} | cut -f19 -d',')
   sample=$(cat ../runinfo.csv | grep ${acc} | cut -f25 -d',')

   if [[ -e ../fastq/${acc}_1.fastq && ../fastq/${acc}_2.fastq ]]; then
      echo Aligning ${acc}
      bwa mem \
        -t ${thread} \
        -R "@RG\tID:${acc}\tSM:${sample}\tPL:${platform}" \
        ${ref} \
        ../fastq/${acc}_1.fastq \
        ../fastq/${acc}_2.fastq |
        samtools sort -@ ${thread} -O BAM |\
        tee ${acc}.bam |\
        samtools index - ${acc}.bam.bai
   fi

done
cd ..

Call variants

We will use BCFtools for calling variants.

mkdir vcf && cd vcf
ref=../ref/ref.fa

for bam in $(ls ../bam/*.bam); do

   echo ${bam}
   base=$(basename ${bam} .bam)
   bcftools mpileup -d 10000 -O v -f ${ref} ${bam} | bcftools call -mv -O v -o ${base}.vcf
   bgzip ${base}.vcf
   tabix -p vcf ${base}.vcf.gz

done

bcftools merge -o PRJNA784038_illumina.vcf -O v SRR*.vcf.gz
vt decompose PRJNA784038_illumina.vcf -o PRJNA784038_illumina.vt.vcf
java -Xmx8g -jar ../src/snpEff/snpEff.jar NC_045512.2 PRJNA784038_illumina.vt.vcf > PRJNA784038_illumina.vt.ann.vcf
bgzip PRJNA784038_illumina.vt.ann.vcf
tabix -p vcf PRJNA784038_illumina.vt.ann.vcf.gz

Spike protein mutations

As you probably heard by now, the Omicron variant has over 30 mutations in the spike protein. The genomic coordinates for the spike protein are between 21563 and 25384, so we can subset variants in our VCF file to correspond to mutations in the spike protein. In addition, we will use BCFtools to only report variants that were called in over 30 of the samples. Some genomic positions have more than one alternate allele but since we used vt, they are decomposed onto multiple lines. To count the number of sites with one or more mutations, we can sort for unique positions.

bcftools filter -i 'AN>30 & POS > 21562 & POS < 25385' PRJNA784038_illumina.vt.ann.vcf.gz | grep -v "^#" | wc -l
35

bcftools filter -i 'AN>30 & POS > 21562 & POS < 25385' PRJNA784038_illumina.vt.ann.vcf.gz | grep -v "^#" | cut -f1,2 | sort -u | wc -l
31

To list the variants in HGVS nomenclature, we can use the extractFields subtool from SnpSift.

java -Xmx8g -jar ../src/snpEff/SnpSift.jar extractFields PRJNA784038_illumina.vt.ann.vcf.gz CHROM POS AN "ANN[*].HGVS_P" | cut -f1-4 | awk '$2>21562 && $2<25385 && $3 > 30 {print}'
NC_045512.2     21762   132     p.Ala67Val
NC_045512.2     21768   70      p.His69Leu
NC_045512.2     21770   44      p.Val70Ile
NC_045512.2     21770   44      p.Val70Leu
NC_045512.2     21770   44      p.Val70Phe
NC_045512.2     21846   134     p.Thr95Ile
NC_045512.2     21995   36      p.Tyr145Asp
NC_045512.2     21995   36      p.Tyr145Asn
NC_045512.2     22194   122     p.Asn211Ile
NC_045512.2     22578   130     p.Gly339Asp
NC_045512.2     22673   126     p.Ser371Pro
NC_045512.2     22674   126     p.Ser371Phe
NC_045512.2     22679   126     p.Ser373Pro
NC_045512.2     22686   126     p.Ser375Phe
NC_045512.2     22992   128     p.Ser477Asn
NC_045512.2     22995   132     p.Thr478Lys
NC_045512.2     23013   130     p.Glu484Ala
NC_045512.2     23040   130     p.Gln493Arg
NC_045512.2     23048   128     p.Gly496Ser
NC_045512.2     23055   130     p.Gln498Arg
NC_045512.2     23063   132     p.Asn501Tyr
NC_045512.2     23075   130     p.Tyr505His
NC_045512.2     23202   132     p.Thr547Lys
NC_045512.2     23403   136     p.Asp614Gly
NC_045512.2     23525   130     p.His655Tyr
NC_045512.2     23599   132     p.Asn679Lys
NC_045512.2     23599   132     p.Asn679Lys
NC_045512.2     23604   130     p.Pro681His
NC_045512.2     23854   78      p.Asn764Lys
NC_045512.2     23948   130     p.Asp796Tyr
NC_045512.2     24130   132     p.Asn856Lys
NC_045512.2     24424   134     p.Gln954His
NC_045512.2     24469   136     p.Asn969Lys
NC_045512.2     24503   136     p.Leu981Phe
NC_045512.2     25000   134     p.Asp1146Asp

Many of the variants above are already reported on CoVariants. If you use this data, please cite Rapid epidemic expansion of the SARS-CoV-2 Omicron variant in southern Africa.

For those that do not want to run the entire workflow but want to analyse the variant calls, I have hosted the final VCF file (and the index file) with the Omicron variants on my web server.

Finally, I have some other analyses and notes on SARS-CoV-2 on my GitHub.

Print Friendly, PDF & Email



Creative Commons License
This work is licensed under a Creative Commons
Attribution 4.0 International License
.
10 comments Add yours
  1. Hello Dave,

    Great post, as usual. I have a couple of questions:

    1. Every time I exit out from the docker, will I lose all the programs I installed – eg BWA, SNPEff, etc? Is there a way to not to install all of these programs every time?

    2. The commands to count the number of sites with one or more mutations is not working (bcftools filter -i ‘AN>30 & POS > 21562 & POS < 25385' PRJNA784038_illumina.vt.ann.vcf.gz | grep -v "^#" | wc -l)

    I am getting, "the file ""25385' PRJNA784038_illumina.vt.ann.vcf.gz"" not recognizable ???

    Thx a lot!

    1. Hi Akhil,

      1. If you start Docker as per my command at the start of the post, all the tools, data, and results will be in the directory where you ran the Docker command.
      2. It looks like the quote before AN and the quote after 25385 are unmatched. I’m not sure how that happened because if I copy and paste the command from my blog post I get matched single quotes.

      Hope that helps,
      Dave

      1. Great, sounds good. Let me test it out and I will update you soon with my findings/success. Thanks once again and can’t wait to read your next post 🙂

        1. Using the docker command I installed all the tools in the “pwd”.

          I exited out of the terminal.

          Using your docker command (docker run –rm -it -v $(pwd):$(pwd) -w $(pwd) davetang/build:1.1 /bin/bash) I again entered inside the environment.

          I do see the “src” and other directories.

          But when I try to use the command/tool, say “fasterq”, to download the data, I get the following error, as if the system is not able to recognize the tool:

          Downloading SRR17051948
          bash: fasterq-dump: command not found
          Downloading SRR17051947
          bash: fasterq-dump: command not found
          Downloading SRR17051946
          bash: fasterq-dump: command not found
          Downloading SRR17051944
          bash: fasterq-dump: command not found
          Downloading SRR17051943
          bash: fasterq-dump: command not found

          What am I doing wrong?

          Could it be that I am not able get into the correct container???

          Thanks

          1. Hi Akhil,

            you’re doing nothing wrong; I wrote the post without the intention of re-running the container. You are getting an error because the PATH no longer contains the directories of where the tools are installed and therefore the shell cannot find the tools.

            What you need to do is to follow all the “cd” commands and re-enter the export PATH commands. However, since the post installs BWA, vt, SAMtools, BCFtools, and HTSlib into /usr/local/bin (inside the container), you will need to re-run those installations, for now.

            I’ll modify the post later so the tools will be installed into a common directory on the mounted volume instead, so you don’t need to reinstall them. Then you just need to run the export command once.

            Cheers,
            Dave

            1. Fantastic, thx a lot, that makes sense.

              On a separate note, I was able to count the number of sites with one or more mutations and was able to use extractFields. I am so excited!

              Will be waiting eagerly to get the modified code so that one doesn’t have to re-install the tools, every time 🙂

              Also will be waiting to read more blog-posts from you regarding efficient and reproducible bioinformatics workflows 🙂

              Thanks and take care!

              1. Happy to hear that you could reproduce the analysis! I have updated the post so that the tools can be found in a new container instance. You will have to re-run some of the commands so that the tools will be copied to the tools directory. Thanks and take care as well!

                1. Thanks a lot Dave, I re-installed all the tools, data and performed analysis. Then I re-entered the environment and was able to re-use the previously installed tools. Yaaay.

                  Once again thx a lot, it is just amazing.

Leave a Reply

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.