In this post, I will compare different tools for comparing VCF files. To create a reproducible example, I will make use of Docker and Conda. I highly recommend learning about these tools if you haven't already; they make it easier to reproduce your work. I have written some notes on Docker and Conda that maybe useful.
The example VCF file and other scripts used for this post are available in my GitHub repository, so please clone that somewhere if you want to follow this post. The repo also has a bunch of notes on VCF files in general.
git clone https://github.com/davetang/learning_vcf_file.git
If you don't have Docker, install it first. Next, pull the Conda image and use Conda to install the various tools that will be used in this post.
docker pull continuumio/miniconda docker run --rm -v /Users/dtang/github/learning_vcf_file:/data -i -t continuumio/miniconda /bin/bash # run the following commands inside Docker conda install -y -c bioconda vcftools conda install -y -c bioconda bcftools conda install -y -c bioconda bedtools conda install -y -c bioconda perl-vcftools-vcf conda install -y -c bioconda tabix conda install -y -c bioconda snpsift
Alternatively, you can prepare a YAML file that contains the required tools and create a Conda environment. Below are the contents of variant.yml, which is also in my GitHub repository.
name: variant channels: - bioconda - anaconda - conda-forge - defaults dependencies: - vcftools - bcftools - bedtools - perl-vcftools-vcf - tabix - snpsift
Using conda env create will build a Conda environment with all the required tools for this post.
# you should see base conda env list # create environment conda env create -f variant.yml # activate environment conda activate variant
VCF file
I will use NA12878.knowledgebase.snapshot.20131119.hg19.vcf.gz as an example VCF file; this file can be downloaded from the GATK bundle but I have also provided it in my GitHub repository. I used BCFtools with some other command line tools to get a feel of the VCF file. (I chose this file because it is relatively small: 4.1M.)
# total number of SNPs bcftools view -v snps NA12878.knowledgebase.snapshot.20131119.hg19.vcf.gz | grep -v "^#" | wc -l 281346 # total number of unique positions, indicating that several sites have two or more alternate alleles bcftools view -v snps NA12878.knowledgebase.snapshot.20131119.hg19.vcf.gz | grep -v "^#" | cut -f2 | sort -u | wc -l 280764 # distribution of ref vs. alt alleles # notice the single dinucleotide change TG -> CG and an unnormalised variant AT -> AC bcftools view -v snps NA12878.knowledgebase.snapshot.20131119.hg19.vcf.gz | grep -v "^#" | cut -f4,5 | sort | uniq -c | sort -k1rn 60826 G A 59893 C T 37360 T C 37317 A G 13395 C G 13356 G C 12737 C A 12552 G T 10038 T G 9957 A C 6957 T A 6956 A T 1 AT AC 1 TG CG
Comparing VCF files
First, I will prepare a compressed VCF file with only SNPs (removing the dinucleotide and unnormalised variant) and compare it to itself.
bcftools view -v snps NA12878.knowledgebase.snapshot.20131119.hg19.vcf.gz | perl -lane 'if (/^#/) { print } elsif (length($F[3]) == 1) { print }' | bgzip > snp.vcf.gz # index tabix -p vcf snp.vcf.gz
Compare VCF files using BEDTools.
# I use the -u parameter, which reports the variant in -a if an overlap was found bedtools intersect -u -a snp.vcf.gz -b snp.vcf.gz | wc -l 281344 # calculate Jaccard index bedtools jaccard -a snp.vcf.gz -b snp.vcf.gz intersection union jaccard n_intersections 281003 281003 1 277984
I'm not sure how n_intersections ended up as 277,984.
The next tool to test is vcf-compare, which also reports the number of duplicate sites. We can see the same results as bedtools jaccard, which is that 281,003 variants overlap each other.
vcf-compare snp.vcf.gz snp.vcf.gz # This file was generated by vcf-compare. # The command line was: vcf-compare(v0.1.14-12-gcdb80b8) snp.vcf.gz snp.vcf.gz # #VN 'Venn-Diagram Numbers'. Use `grep ^VN | cut -f 2-` to extract this part. #VN The columns are: #VN 1 .. number of sites unique to this particular combination of files #VN 2- .. combination of files and space-separated number, a fraction of sites in the file VN 281003 snp.vcf.gz (100.0%) #SN Summary Numbers. Use `grep ^SN | cut -f 2-` to extract this part. SN Number of REF matches: 0 SN Number of ALT matches: 0 SN Number of REF mismatches: 0 SN Number of ALT mismatches: 0 SN Number of samples in GT comparison: 0 # Number of sites lost due to grouping (e.g. duplicate sites): lost, %lost, read, reported, file SN Number of lost sites: 341 0.1% 281344 281003 snp.vcf.gz SN Number of lost sites: 341 0.1% 281344 281003 snp.vcf.gz
The tool bcftools performs the VCF comparison and generates results (four separate VCF files) in a specified directory.
bcftools isec snp.vcf.gz snp.vcf.gz -p isec cat isec/README.txt This file was produced by vcfisec. The command line was: bcftools isec -p isec snp.vcf.gz snp.vcf.gz Using the following file names: isec/0000.vcf for records private to snp.vcf.gz isec/0001.vcf for records private to snp.vcf.gz isec/0002.vcf for records from snp.vcf.gz shared by both snp.vcf.gz snp.vcf.gz isec/0003.vcf for records from snp.vcf.gz shared by both snp.vcf.gz snp.vcf.gz # no private variants cat isec/0000.vcf | grep -v "^#" | wc -l 0 # no private variants cat isec/0001.vcf | grep -v "^#" | wc -l 0 # shared variants cat isec/0002.vcf | grep -v "^#" | wc -l 281344 # shared variants cat isec/0003.vcf | grep -v "^#" | wc -l 281344
Finally I'll use SnpSift, which I have written a separate blog post on. Unfortunately this tool cannot use a bgzip file, so I had to gunzip the file.
gunzip snp.vcf.gz SnpSift concordance -v snp.vcf snp.vcf > snp_concordance.txt # output snipped 00:01:27 Writing concordance by sample to file 'concordance_snp_snp.by_sample.txt' 00:01:27 Writing summary file 'concordance_snp_snp.summary.txt' cat concordance_snp_snp.by_sample.txt | ../script/transpose.pl sample NA12878 MISSING_ENTRY_snp/MISSING_ENTRY_snp 0 MISSING_ENTRY_snp/MISSING_GT_snp 0 MISSING_ENTRY_snp/REF 0 MISSING_ENTRY_snp/ALT_1 0 MISSING_ENTRY_snp/ALT_2 0 MISSING_GT_snp/MISSING_ENTRY_snp 0 MISSING_GT_snp/MISSING_GT_snp 136736 MISSING_GT_snp/REF 0 MISSING_GT_snp/ALT_1 0 MISSING_GT_snp/ALT_2 0 REF/MISSING_ENTRY_snp 0 REF/MISSING_GT_snp 0 REF/REF 13 REF/ALT_1 0 REF/ALT_2 0 ALT_1/MISSING_ENTRY_snp 0 ALT_1/MISSING_GT_snp 0 ALT_1/REF 0 ALT_1/ALT_1 90140 ALT_1/ALT_2 0 ALT_2/MISSING_ENTRY_snp 0 ALT_2/MISSING_GT_snp 0 ALT_2/REF 0 ALT_2/ALT_1 0 ALT_2/ALT_2 54455 ERROR 0
The summary by SnpSift also shows the type of agreement and/or disagreement (which we don't see because we're comparing the same file).
Now to compare two VCF files with different subsets of variants. The Perl one-liner creates (reproducible) variant subsets by generating a random number (using a seed) for each variant and only printing out the variant if the generated random number is greater than 0.5. Our two VCF files will have half the variants of the original file and will likely overlap 50% of the time.
# first VCF file bcftools view -v snps NA12878.knowledgebase.snapshot.20131119.hg19.vcf.gz | perl -lane 'BEGIN {srand(31)} if (/^#/) { print } elsif (length($F[3]) == 1) { if (rand(1) > 0.5) {print} }' | bgzip > first.vcf.gz tabix -p vcf first.vcf.gz # second VCF file bcftools view -v snps NA12878.knowledgebase.snapshot.20131119.hg19.vcf.gz | perl -lane 'BEGIN {srand(1984)} if (/^#/) { print } elsif (length($F[3]) == 1) { if (rand(1) > 0.5) {print} }' | bgzip > second.vcf.gz tabix -p vcf second.vcf.gz # number of variants zcat first.vcf.gz| grep -v "^#" | wc -l 140879 zcat second.vcf.gz | grep -v "^#" | wc -l 140367
Perform the VCF comparisons using the various tools.
# intersect reports results with respect to -a bedtools intersect -u -a first.vcf.gz -b second.vcf.gz | wc -l 70446 # results differ when the second file is used as -a bedtools intersect -u -a second.vcf.gz -b first.vcf.gz | wc -l 70454 bedtools jaccard -a first.vcf.gz -b second.vcf.gz intersection union jaccard n_intersections 70367 210677 0.334004 70156 vcf-compare first.vcf.gz second.vcf.gz # This file was generated by vcf-compare. # The command line was: vcf-compare(v0.1.14-12-gcdb80b8) first.vcf.gz second.vcf.gz # #VN 'Venn-Diagram Numbers'. Use `grep ^VN | cut -f 2-` to extract this part. #VN The columns are: #VN 1 .. number of sites unique to this particular combination of files #VN 2- .. combination of files and space-separated number, a fraction of sites in the file VN 69895 second.vcf.gz (49.8%) VN 70367 first.vcf.gz (50.0%) second.vcf.gz (50.2%) VN 70415 first.vcf.gz (50.0%) #SN Summary Numbers. Use `grep ^SN | cut -f 2-` to extract this part. SN Number of REF matches: 70367 SN Number of ALT matches: 70209 SN Number of REF mismatches: 0 SN Number of ALT mismatches: 158 SN Number of samples in GT comparison: 0 # Number of sites lost due to grouping (e.g. duplicate sites): lost, %lost, read, reported, file SN Number of lost sites: 97 0.1% 140879 140782 first.vcf.gz SN Number of lost sites: 105 0.1% 140367 140262 second.vcf.gz bcftools isec first.vcf.gz second.vcf.gz -p first_second # records private to first.vcf.gz cat first_second/0000.vcf | grep -v "^#" | wc -l 70529 # records private to second.vcf.gz cat first_second/0001.vcf | grep -v "^#" | wc -l 70017 # records from first.vcf.gz shared by both first.vcf.gz second.vcf.gz cat first_second/0002.vcf | grep -v "^#" | wc -l 70350 # records from second.vcf.gz shared by both first.vcf.gz second.vcf.gz cat first_second/0003.vcf | grep -v "^#" | wc -l 70350 # unzip for SnpSift gunzip first.vcf.gz gunzip second.vcf.gz SnpSift concordance -v first.vcf second.vcf > snp_concordance.txt # output snipped 00:01:20 Writing concordance by sample to file 'concordance_first_second.by_sample.txt' 00:01:20 Writing summary file 'concordance_first_second.summary.txt' # Errors: ALT field does not match 93 cat concordance_first_second.by_sample.txt | ../script/transpose.pl sample NA12878 MISSING_ENTRY_first/MISSING_ENTRY_second 0 MISSING_ENTRY_first/MISSING_GT_second 33662 MISSING_ENTRY_first/REF 7 MISSING_ENTRY_first/ALT_1 22635 MISSING_ENTRY_first/ALT_2 13662 MISSING_GT_first/MISSING_ENTRY_second 34499 MISSING_GT_first/MISSING_GT_second 34248 MISSING_GT_first/REF 0 MISSING_GT_first/ALT_1 0 MISSING_GT_first/ALT_2 0 REF/MISSING_ENTRY_second 1 REF/MISSING_GT_second 0 REF/REF 4 REF/ALT_1 0 REF/ALT_2 0 ALT_1/MISSING_ENTRY_second 22462 ALT_1/MISSING_GT_second 0 ALT_1/REF 0 ALT_1/ALT_1 22457 ALT_1/ALT_2 0 ALT_2/MISSING_ENTRY_second 13516 ALT_2/MISSING_GT_second 0 ALT_2/REF 0 ALT_2/ALT_1 0 ALT_2/ALT_2 13599 ERROR 93
Summary
In this post I demonstrate the use of four different tools for comparing VCF files. Here's a short summary of each tool:
* BEDTools can be used to compare VCF files but only by comparing genomic coordinates; this can provide a quick answer to how many variants overlap and can be used to calculate a Jaccard index, indicating the amount of overall overlap
* vcf-compare provides additional statistics from BEDTools including the number of duplicate sites and Venn-Diagram Numbers, which show the number of exclusive variants in each respective VCF file
* bcftools isec also provides Venn-Diagram Numbers and additionally creates VCF files based on these intersections
* SnpSift concordance provides intersection counts as well as genotype differences between two VCF files; this is particularly useful for comparing variant calls from two different tools
Of note is that the numbers are slightly different between the different tools. I suspect that this may be due to the way each tool deals with duplicated sites. Furthermore, I have a section in a blog post (see section: "The Variant Call Format") that discusses how VCF files can be standardised so that they can actually be compared to each other.
Finally, I haven't written a blog post in a while but I will try to set aside some time for blogging.

This work is licensed under a Creative Commons
Attribution 4.0 International License.
Hi Dave Tang,
Thankyou for your wonderful and very informative blog. I tried the bcftools option you had provided in one of your blogs ( I got it through google search) and when I implemented on my samples I dont get a exact tally of total variants.
bcftools stats -s – my.vcf | grep -A 169 “Per-sample counts” > Persample_countsALL.txt
I see they dont tally when I add nhomRef+nhomALt+Het+nMissing
I get a couple numbers off from total counts.
I am wondering why and if it is missing some no calls? But that should be under the nMissing, isn it? Any advice would be appreciated!