Comparing VCF files

In this post, I will compare different tools for comparing VCF files. To create a reproducible example, I will make use of Docker and Conda. I highly recommend learning about these tools if you haven’t already; they make it easier to reproduce your work. I have written some notes on Docker and Conda that maybe useful.

The example VCF file and other scripts used for this post are available in my GitHub repository, so please clone that somewhere if you want to follow this post. The repo also has a bunch of notes on VCF files in general.

git clone https://github.com/davetang/learning_vcf_file.git

If you don’t have Docker, install it first. Next, pull the Conda image and use Conda to install the various tools that will be used in this post.

docker pull continuumio/miniconda
docker run --rm -v /Users/dtang/github/learning_vcf_file:/data -i -t continuumio/miniconda /bin/bash

# run the following commands inside Docker
conda install -y -c bioconda vcftools
conda install -y -c bioconda bcftools
conda install -y -c bioconda bedtools
conda install -y -c bioconda perl-vcftools-vcf
conda install -y -c bioconda tabix
conda install -y -c bioconda snpsift

Alternatively, you can prepare a YAML file that contains the required tools and create a Conda environment. Below are the contents of variant.yml, which is also in my GitHub repository.

name: variant
channels:
  - bioconda
  - anaconda
  - conda-forge
  - defaults
dependencies:
  - vcftools
  - bcftools
  - bedtools
  - perl-vcftools-vcf
  - tabix
  - snpsift

Using conda env create will build a Conda environment with all the required tools for this post.

# you should see base
conda env list

# create environment
conda env create -f variant.yml

# activate environment
conda activate variant

VCF file

I will use NA12878.knowledgebase.snapshot.20131119.hg19.vcf.gz as an example VCF file; this file can be downloaded from the GATK bundle but I have also provided it in my GitHub repository. I used BCFtools with some other command line tools to get a feel of the VCF file. (I chose this file because it is relatively small: 4.1M.)

# total number of SNPs
bcftools view -v snps NA12878.knowledgebase.snapshot.20131119.hg19.vcf.gz | grep -v "^#" | wc -l
281346

# total number of unique positions, indicating that several sites have two or more alternate alleles
bcftools view -v snps NA12878.knowledgebase.snapshot.20131119.hg19.vcf.gz | grep -v "^#" | cut -f2 | sort -u | wc -l
280764

# distribution of ref vs. alt alleles
# notice the single dinucleotide change TG -> CG and an unnormalised variant AT -> AC
bcftools view -v snps NA12878.knowledgebase.snapshot.20131119.hg19.vcf.gz | grep -v "^#" | cut -f4,5 | sort | uniq -c | sort -k1rn
  60826 G       A
  59893 C       T
  37360 T       C
  37317 A       G
  13395 C       G
  13356 G       C
  12737 C       A
  12552 G       T
  10038 T       G
   9957 A       C
   6957 T       A
   6956 A       T
      1 AT      AC
      1 TG      CG

Comparing VCF files

First, I will prepare a compressed VCF file with only SNPs (removing the dinucleotide and unnormalised variant) and compare it to itself.

bcftools view -v snps NA12878.knowledgebase.snapshot.20131119.hg19.vcf.gz |
perl -lane 'if (/^#/) { print } elsif (length($F[3]) == 1) { print }' |
bgzip > snp.vcf.gz

# index
tabix -p vcf snp.vcf.gz

Compare VCF files using BEDTools.

# I use the -u parameter, which reports the variant in -a if an overlap was found
bedtools intersect -u -a snp.vcf.gz -b snp.vcf.gz | wc -l
281344

# calculate Jaccard index
bedtools jaccard -a snp.vcf.gz -b snp.vcf.gz
intersection    union   jaccard n_intersections
281003  281003  1       277984

I’m not sure how n_intersections ended up as 277,984.

The next tool to test is vcf-compare, which also reports the number of duplicate sites. We can see the same results as bedtools jaccard, which is that 281,003 variants overlap each other.

vcf-compare snp.vcf.gz snp.vcf.gz
# This file was generated by vcf-compare.
# The command line was: vcf-compare(v0.1.14-12-gcdb80b8) snp.vcf.gz snp.vcf.gz
#
#VN 'Venn-Diagram Numbers'. Use `grep ^VN | cut -f 2-` to extract this part.
#VN The columns are:
#VN        1  .. number of sites unique to this particular combination of files
#VN        2- .. combination of files and space-separated number, a fraction of sites in the file
VN      281003  snp.vcf.gz (100.0%)
#SN Summary Numbers. Use `grep ^SN | cut -f 2-` to extract this part.
SN      Number of REF matches:  0
SN      Number of ALT matches:  0
SN      Number of REF mismatches:       0
SN      Number of ALT mismatches:       0
SN      Number of samples in GT comparison:     0
# Number of sites lost due to grouping (e.g. duplicate sites): lost, %lost, read, reported, file
SN      Number of lost sites:   341     0.1%    281344  281003  snp.vcf.gz
SN      Number of lost sites:   341     0.1%    281344  281003  snp.vcf.gz

The tool bcftools performs the VCF comparison and generates results (four separate VCF files) in a specified directory.

bcftools isec snp.vcf.gz snp.vcf.gz -p isec

cat isec/README.txt 
This file was produced by vcfisec.
The command line was:   bcftools isec  -p isec snp.vcf.gz snp.vcf.gz

Using the following file names:
isec/0000.vcf   for records private to  snp.vcf.gz
isec/0001.vcf   for records private to  snp.vcf.gz
isec/0002.vcf   for records from snp.vcf.gz shared by both      snp.vcf.gz snp.vcf.gz
isec/0003.vcf   for records from snp.vcf.gz shared by both      snp.vcf.gz snp.vcf.gz

# no private variants
cat isec/0000.vcf | grep -v "^#" | wc -l
0

# no private variants
cat isec/0001.vcf | grep -v "^#" | wc -l
0

# shared variants
cat isec/0002.vcf | grep -v "^#" | wc -l
281344

# shared variants
cat isec/0003.vcf | grep -v "^#" | wc -l
281344

Finally I’ll use SnpSift, which I have written a separate blog post on. Unfortunately this tool cannot use a bgzip file, so I had to gunzip the file.

gunzip snp.vcf.gz
SnpSift concordance -v snp.vcf snp.vcf > snp_concordance.txt

# output snipped
00:01:27        Writing concordance by sample to file 'concordance_snp_snp.by_sample.txt'
00:01:27        Writing summary file 'concordance_snp_snp.summary.txt'

cat concordance_snp_snp.by_sample.txt | ../script/transpose.pl 
sample  NA12878



MISSING_ENTRY_snp/MISSING_ENTRY_snp     0
MISSING_ENTRY_snp/MISSING_GT_snp        0
MISSING_ENTRY_snp/REF   0
MISSING_ENTRY_snp/ALT_1 0
MISSING_ENTRY_snp/ALT_2 0
MISSING_GT_snp/MISSING_ENTRY_snp        0
MISSING_GT_snp/MISSING_GT_snp   136736
MISSING_GT_snp/REF      0
MISSING_GT_snp/ALT_1    0
MISSING_GT_snp/ALT_2    0
REF/MISSING_ENTRY_snp   0
REF/MISSING_GT_snp      0
REF/REF 13
REF/ALT_1       0
REF/ALT_2       0
ALT_1/MISSING_ENTRY_snp 0
ALT_1/MISSING_GT_snp    0
ALT_1/REF       0
ALT_1/ALT_1     90140
ALT_1/ALT_2     0
ALT_2/MISSING_ENTRY_snp 0
ALT_2/MISSING_GT_snp    0
ALT_2/REF       0
ALT_2/ALT_1     0
ALT_2/ALT_2     54455
ERROR   0

The summary by SnpSift also shows the type of agreement and/or disagreement (which we don’t see because we’re comparing the same file).

Now to compare two VCF files with different subsets of variants. The Perl one-liner creates (reproducible) variant subsets by generating a random number (using a seed) for each variant and only printing out the variant if the generated random number is greater than 0.5. Our two VCF files will have half the variants of the original file and will likely overlap 50% of the time.

# first VCF file
bcftools view -v snps NA12878.knowledgebase.snapshot.20131119.hg19.vcf.gz |
perl -lane 'BEGIN {srand(31)} if (/^#/) { print } elsif (length($F[3]) == 1) { if (rand(1) > 0.5) {print} }' |
bgzip > first.vcf.gz
tabix -p vcf first.vcf.gz

# second VCF file
bcftools view -v snps NA12878.knowledgebase.snapshot.20131119.hg19.vcf.gz |
perl -lane 'BEGIN {srand(1984)} if (/^#/) { print } elsif (length($F[3]) == 1) { if (rand(1) > 0.5) {print} }' |
bgzip > second.vcf.gz
tabix -p vcf second.vcf.gz

# number of variants
zcat first.vcf.gz| grep -v "^#" | wc -l
140879

zcat second.vcf.gz | grep -v "^#" | wc -l
140367

Perform the VCF comparisons using the various tools.

# intersect reports results with respect to -a
bedtools intersect -u -a first.vcf.gz -b second.vcf.gz | wc -l
70446

# results differ when the second file is used as -a
bedtools intersect -u -a second.vcf.gz -b first.vcf.gz | wc -l
70454

bedtools jaccard -a first.vcf.gz -b second.vcf.gz
intersection    union   jaccard n_intersections
70367   210677  0.334004        70156

vcf-compare first.vcf.gz second.vcf.gz
# This file was generated by vcf-compare.
# The command line was: vcf-compare(v0.1.14-12-gcdb80b8) first.vcf.gz second.vcf.gz
#
#VN 'Venn-Diagram Numbers'. Use `grep ^VN | cut -f 2-` to extract this part.
#VN The columns are: 
#VN        1  .. number of sites unique to this particular combination of files
#VN        2- .. combination of files and space-separated number, a fraction of sites in the file
VN      69895   second.vcf.gz (49.8%)
VN      70367   first.vcf.gz (50.0%)    second.vcf.gz (50.2%)
VN      70415   first.vcf.gz (50.0%)
#SN Summary Numbers. Use `grep ^SN | cut -f 2-` to extract this part.
SN      Number of REF matches:  70367
SN      Number of ALT matches:  70209
SN      Number of REF mismatches:       0
SN      Number of ALT mismatches:       158
SN      Number of samples in GT comparison:     0
# Number of sites lost due to grouping (e.g. duplicate sites): lost, %lost, read, reported, file
SN      Number of lost sites:   97      0.1%    140879  140782  first.vcf.gz
SN      Number of lost sites:   105     0.1%    140367  140262  second.vcf.gz

bcftools isec first.vcf.gz second.vcf.gz -p first_second

# records private to  first.vcf.gz
cat first_second/0000.vcf | grep -v "^#" | wc -l
70529

# records private to  second.vcf.gz
cat first_second/0001.vcf | grep -v "^#" | wc -l
70017

# records from first.vcf.gz shared by both    first.vcf.gz second.vcf.gz
cat first_second/0002.vcf | grep -v "^#" | wc -l
70350

# records from second.vcf.gz shared by both   first.vcf.gz second.vcf.gz
cat first_second/0003.vcf | grep -v "^#" | wc -l
70350

# unzip for SnpSift
gunzip first.vcf.gz
gunzip second.vcf.gz
SnpSift concordance -v first.vcf second.vcf > snp_concordance.txt

# output snipped
00:01:20        Writing concordance by sample to file 'concordance_first_second.by_sample.txt'
00:01:20        Writing summary file 'concordance_first_second.summary.txt'

# Errors:
        ALT field does not match        93

cat concordance_first_second.by_sample.txt | ../script/transpose.pl 
sample  NA12878



MISSING_ENTRY_first/MISSING_ENTRY_second        0
MISSING_ENTRY_first/MISSING_GT_second   33662
MISSING_ENTRY_first/REF 7
MISSING_ENTRY_first/ALT_1       22635
MISSING_ENTRY_first/ALT_2       13662
MISSING_GT_first/MISSING_ENTRY_second   34499
MISSING_GT_first/MISSING_GT_second      34248
MISSING_GT_first/REF    0
MISSING_GT_first/ALT_1  0
MISSING_GT_first/ALT_2  0
REF/MISSING_ENTRY_second        1
REF/MISSING_GT_second   0
REF/REF 4
REF/ALT_1       0
REF/ALT_2       0
ALT_1/MISSING_ENTRY_second      22462
ALT_1/MISSING_GT_second 0
ALT_1/REF       0
ALT_1/ALT_1     22457
ALT_1/ALT_2     0
ALT_2/MISSING_ENTRY_second      13516
ALT_2/MISSING_GT_second 0
ALT_2/REF       0
ALT_2/ALT_1     0
ALT_2/ALT_2     13599
ERROR   93

Summary

In this post I demonstrate the use of four different tools for comparing VCF files. Here’s a short summary of each tool:

* BEDTools can be used to compare VCF files but only by comparing genomic coordinates; this can provide a quick answer to how many variants overlap and can be used to calculate a Jaccard index, indicating the amount of overall overlap
* vcf-compare provides additional statistics from BEDTools including the number of duplicate sites and Venn-Diagram Numbers, which show the number of exclusive variants in each respective VCF file
* bcftools isec also provides Venn-Diagram Numbers and additionally creates VCF files based on these intersections
* SnpSift concordance provides intersection counts as well as genotype differences between two VCF files; this is particularly useful for comparing variant calls from two different tools

Of note is that the numbers are slightly different between the different tools. I suspect that this may be due to the way each tool deals with duplicated sites. Furthermore, I have a section in a blog post (see section: “The Variant Call Format”) that discusses how VCF files can be standardised so that they can actually be compared to each other.

Finally, I haven’t written a blog post in a while but I will try to set aside some time for blogging.

Print Friendly, PDF & Email



Creative Commons License
This work is licensed under a Creative Commons
Attribution 4.0 International License
.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.