I want to compare the genotype concordance between two VCF files and I came across SnpSift, which seems to calculate the statistics that I want. However, the format of the results from my run differ from the format in the documentation. In this post, I will try to come up with the exact scenarios that fall into each of the summary statistics (to satisfy my curiosity).
Firstly download the program:
# the latest version as of this post is snpEff_v4_1i_core.zip wget http://downloads.sourceforge.net/project/snpeff/snpEff_latest_core.zip
To get started, I created a test VCF file (a.vcf) and made a copy of it (b.vcf).
cat a.vcf ##fileformat=VCFv4.2 #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT a b c 1 31 rs31 A G . . . GT 0/0 0/0 0/0 cp a.vcf b.vcf
Running SnpSift concordance on a.vcf and b.vcf:
java -Xmx1g -jar \ /home/san/dtang/src/snpEff/SnpSift.jar concordance \ -v a.vcf b.vcf \ > concordance_a_b.by_variant.txt
To make the results easier to peruse, I'll use my transpose script.
cat concordance_a_b.by_variant.txt | transpose chr 1 pos 31 ref A alt G MISSING_ENTRY_a/MISSING_ENTRY_b 0 MISSING_ENTRY_a/MISSING_GT_b 0 MISSING_ENTRY_a/REF 0 MISSING_ENTRY_a/ALT_1 0 MISSING_ENTRY_a/ALT_2 0 MISSING_GT_a/MISSING_ENTRY_b 0 MISSING_GT_a/MISSING_GT_b 0 MISSING_GT_a/REF 0 MISSING_GT_a/ALT_1 0 MISSING_GT_a/ALT_2 0 REF/MISSING_ENTRY_b 0 REF/MISSING_GT_b 0 REF/REF 3 REF/ALT_1 0 REF/ALT_2 0 ALT_1/MISSING_ENTRY_b 0 ALT_1/MISSING_GT_b 0 ALT_1/REF 0 ALT_1/ALT_1 0 ALT_1/ALT_2 0 ALT_2/MISSING_ENTRY_b 0 ALT_2/MISSING_GT_b 0 ALT_2/REF 0 ALT_2/ALT_1 0 ALT_2/ALT_2 0 ERROR 0
The summary statistics are made up of pairwise combinations of five categories: MISSING_ENTRY, MISSING_GT, REF, ALT_1, and ALT_2. From the results, there are three counts of REF/REF, which I assume means that there were three matching genotypes between the two files and the genotypes were homozygous reference.
However, I am not sure what MISSING_ENTRY_a/MISSING_ENTRY_b constitutes. If an entry is missing in a.vcf and also missing in b.vcf, that means it doesn't exist right? Therefore, I'll start creating scenarios from the second summary statistic: MISSING_ENTRY_a/MISSING_GT_b. To save some space, I will not paste the code for SnpSift, which was run exactly in the manner described above.
MISSING_ENTRY_a/MISSING_GT_b
cat a.vcf ##fileformat=VCFv4.2 #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT a b c 1 31 . A G . . . GT 0/0 0/0 0/0 cat b.vcf ##fileformat=VCFv4.2 #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT a b c 1 32 . A G . . . GT 0/0 0/0 ./. cat concordance_a_b.by_variant.txt | transpose chr 1 1 1 pos 31 32 31 ref A A A alt G G G MISSING_ENTRY_a/MISSING_ENTRY_b 0 0 0 MISSING_ENTRY_a/MISSING_GT_b 0 1 0 MISSING_ENTRY_a/REF 0 2 0 MISSING_ENTRY_a/ALT_1 0 0 0 MISSING_ENTRY_a/ALT_2 0 0 0 MISSING_GT_a/MISSING_ENTRY_b 0 0 0 MISSING_GT_a/MISSING_GT_b 0 0 0 MISSING_GT_a/REF 0 0 0 MISSING_GT_a/ALT_1 0 0 0 MISSING_GT_a/ALT_2 0 0 0 REF/MISSING_ENTRY_b 3 0 3 REF/MISSING_GT_b 0 0 0 REF/REF 0 0 0 REF/ALT_1 0 0 0 REF/ALT_2 0 0 0 ALT_1/MISSING_ENTRY_b 0 0 0 ALT_1/MISSING_GT_b 0 0 0 ALT_1/REF 0 0 0 ALT_1/ALT_1 0 0 0 ALT_1/ALT_2 0 0 0 ALT_2/MISSING_ENTRY_b 0 0 0 ALT_2/MISSING_GT_b 0 0 0 ALT_2/REF 0 0 0 ALT_2/ALT_1 0 0 0 ALT_2/ALT_2 0 0 0 ERROR 0 0 0
From the above example, we also created the MISSING_ENTRY_a/REF scenario, where an entry is in b (SNP at pos 32) but not a and the genotypes from this missing entry are homozygous reference. We also created the REF/MISSING_ENTRY_b scenario, where an entry is in a (SNP at pos 31) but not in b and the genotypes from this missing entry are homozygous reference.
MISSING_ENTRY_a/ALT_1
Having realised that I can shorten this post substantially by creating multiple scenarios in one example, here are the next two VCF files:
cat a.vcf ##fileformat=VCFv4.2 #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT a b c 1 31 . A G . . . GT 0/0 0/1 1/1 cat b.vcf ##fileformat=VCFv4.2 #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT a b c 1 32 . A G . . . GT 0/0 0/1 1/1 cat concordance_a_b.by_variant.txt | transpose chr 1 1 1 pos 31 32 31 ref A A A alt G G G MISSING_ENTRY_a/MISSING_ENTRY_b 0 0 0 MISSING_ENTRY_a/MISSING_GT_b 0 0 0 MISSING_ENTRY_a/REF 0 1 0 MISSING_ENTRY_a/ALT_1 0 1 0 MISSING_ENTRY_a/ALT_2 0 1 0 MISSING_GT_a/MISSING_ENTRY_b 0 0 0 MISSING_GT_a/MISSING_GT_b 0 0 0 MISSING_GT_a/REF 0 0 0 MISSING_GT_a/ALT_1 0 0 0 MISSING_GT_a/ALT_2 0 0 0 REF/MISSING_ENTRY_b 1 0 1 REF/MISSING_GT_b 0 0 0 REF/REF 0 0 0 REF/ALT_1 0 0 0 REF/ALT_2 0 0 0 ALT_1/MISSING_ENTRY_b 1 0 1 ALT_1/MISSING_GT_b 0 0 0 ALT_1/REF 0 0 0 ALT_1/ALT_1 0 0 0 ALT_1/ALT_2 0 0 0 ALT_2/MISSING_ENTRY_b 1 0 1 ALT_2/MISSING_GT_b 0 0 0 ALT_2/REF 0 0 0 ALT_2/ALT_1 0 0 0 ALT_2/ALT_2 0 0 0 ERROR 0 0 0
The three statistics MISSING_ENTRY_a/REF, MISSING_ENTRY_a/ALT_1, and MISSING_ENTRY_a/ALT_2 are now covered. And so are the other three statistics REF/MISSING_ENTRY_b, ALT_1/MISSING_ENTRY_b, and ALT_2/MISSING_ENTRY_b.
MISSING_GT_a/MISSING_ENTRY_b
Here are the next two VCF files:
cat a.vcf ##fileformat=VCFv4.2 #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT a b c 1 31 . A G . . . GT 0/0 0/1 ./. cat b.vcf ##fileformat=VCFv4.2 #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT a b c 1 32 . A G . . . GT 0/0 0/1 ./. cat concordance_a_b.by_variant.txt | transpose chr 1 1 1 pos 31 32 31 ref A A A alt G G G MISSING_ENTRY_a/MISSING_ENTRY_b 0 0 0 MISSING_ENTRY_a/MISSING_GT_b 0 1 0 MISSING_ENTRY_a/REF 0 1 0 MISSING_ENTRY_a/ALT_1 0 1 0 MISSING_ENTRY_a/ALT_2 0 0 0 MISSING_GT_a/MISSING_ENTRY_b 1 0 1 MISSING_GT_a/MISSING_GT_b 0 0 0 MISSING_GT_a/REF 0 0 0 MISSING_GT_a/ALT_1 0 0 0 MISSING_GT_a/ALT_2 0 0 0 REF/MISSING_ENTRY_b 1 0 1 REF/MISSING_GT_b 0 0 0 REF/REF 0 0 0 REF/ALT_1 0 0 0 REF/ALT_2 0 0 0 ALT_1/MISSING_ENTRY_b 1 0 1 ALT_1/MISSING_GT_b 0 0 0 ALT_1/REF 0 0 0 ALT_1/ALT_1 0 0 0 ALT_1/ALT_2 0 0 0 ALT_2/MISSING_ENTRY_b 0 0 0 ALT_2/MISSING_GT_b 0 0 0 ALT_2/REF 0 0 0 ALT_2/ALT_1 0 0 0 ALT_2/ALT_2 0 0 0 ERROR 0 0 0
MISSING_GT_a/MISSING_GT_b
While coming up with this scenario, I had another epiphany; if I add another sample, I can kill 4 birds with one stone.
cat a.vcf ##fileformat=VCFv4.2 #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT a b c d 1 31 . A G . . . GT ./. ./. ./. ./. cat b.vcf ##fileformat=VCFv4.2 #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT a b c d 1 31 . A G . . . GT 0/0 0/1 1/1 ./. cat concordance_a_b.by_variant.txt | transpose chr 1 pos 31 ref A alt G MISSING_ENTRY_a/MISSING_ENTRY_b 0 MISSING_ENTRY_a/MISSING_GT_b 0 MISSING_ENTRY_a/REF 0 MISSING_ENTRY_a/ALT_1 0 MISSING_ENTRY_a/ALT_2 0 MISSING_GT_a/MISSING_ENTRY_b 0 MISSING_GT_a/MISSING_GT_b 1 MISSING_GT_a/REF 1 MISSING_GT_a/ALT_1 1 MISSING_GT_a/ALT_2 1 REF/MISSING_ENTRY_b 0 REF/MISSING_GT_b 0 REF/REF 0 REF/ALT_1 0 REF/ALT_2 0 ALT_1/MISSING_ENTRY_b 0 ALT_1/MISSING_GT_b 0 ALT_1/REF 0 ALT_1/ALT_1 0 ALT_1/ALT_2 0 ALT_2/MISSING_ENTRY_b 0 ALT_2/MISSING_GT_b 0 ALT_2/REF 0 ALT_2/ALT_1 0 ALT_2/ALT_2 0 ERROR 0
Now we have covered MISSING_GT_a/MISSING_GT_b, MISSING_GT_a/REF, MISSING_GT_a/ALT_1, and MISSING_GT_a/ALT_2.
REF/MISSING_GT_b
Here are the VCF files:
cat a.vcf ##fileformat=VCFv4.2 #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT a b c d 1 31 . A G . . . GT 0/0 0/0 0/0 0/0 cat b.vcf ##fileformat=VCFv4.2 #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT a b c d 1 31 . A G . . . GT 0/0 0/1 1/1 ./. cat concordance_a_b.by_variant.txt | transpose chr 1 pos 31 ref A alt G MISSING_ENTRY_a/MISSING_ENTRY_b 0 MISSING_ENTRY_a/MISSING_GT_b 0 MISSING_ENTRY_a/REF 0 MISSING_ENTRY_a/ALT_1 0 MISSING_ENTRY_a/ALT_2 0 MISSING_GT_a/MISSING_ENTRY_b 0 MISSING_GT_a/MISSING_GT_b 0 MISSING_GT_a/REF 0 MISSING_GT_a/ALT_1 0 MISSING_GT_a/ALT_2 0 REF/MISSING_ENTRY_b 0 REF/MISSING_GT_b 1 REF/REF 1 REF/ALT_1 1 REF/ALT_2 1 ALT_1/MISSING_ENTRY_b 0 ALT_1/MISSING_GT_b 0 ALT_1/REF 0 ALT_1/ALT_1 0 ALT_1/ALT_2 0 ALT_2/MISSING_ENTRY_b 0 ALT_2/MISSING_GT_b 0 ALT_2/REF 0 ALT_2/ALT_1 0 ALT_2/ALT_2 0 ERROR 0
Now that you (hopefully) see where this is going, I'm not going to bother with the ALT_1/etc. and the ALT_2/etc. examples. I'll finish this post with the ERROR statistic:
cat a.vcf ##fileformat=VCFv4.2 #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT a 1 31 . A G . . . GT 0/0 1 32 . C T . . . GT 0/0 cat b.vcf ##fileformat=VCFv4.2 #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT a 1 31 . G A . . . GT 0/0 1 32 . T C . . . GT 0/0 cat concordance_a_b.by_variant.txt | transpose chr 1 1 pos 31 32 ref A C alt G T MISSING_ENTRY_a/MISSING_ENTRY_b 0 0 MISSING_ENTRY_a/MISSING_GT_b 0 0 MISSING_ENTRY_a/REF 0 0 MISSING_ENTRY_a/ALT_1 0 0 MISSING_ENTRY_a/ALT_2 0 0 MISSING_GT_a/MISSING_ENTRY_b 0 0 MISSING_GT_a/MISSING_GT_b 0 0 MISSING_GT_a/REF 0 0 MISSING_GT_a/ALT_1 0 0 MISSING_GT_a/ALT_2 0 0 REF/MISSING_ENTRY_b 0 0 REF/MISSING_GT_b 0 0 REF/REF 0 0 REF/ALT_1 0 0 REF/ALT_2 0 0 ALT_1/MISSING_ENTRY_b 0 0 ALT_1/MISSING_GT_b 0 0 ALT_1/REF 0 0 ALT_1/ALT_1 0 0 ALT_1/ALT_2 0 0 ALT_2/MISSING_ENTRY_b 0 0 ALT_2/MISSING_GT_b 0 0 ALT_2/REF 0 0 ALT_2/ALT_1 0 0 ALT_2/ALT_2 0 0 ERROR 1 1 ERROR_ALT_DOES_NOT_MATCH:G/A ERROR_ALT_DOES_NOT_MATCH:T/C
Summary
This post was on calculating the VCF concordance between two VCF files. I found a program called SnpSift, which could calculate the statistics that I wanted, however the results generated from the latest version is not documented (but quite self-explanatory). Therefore, to satisfy myself I created various test cases that would generate the various statistics calculated by the program. However, after all of this, I am still not sure how I can create an example that falls into the MISSING_ENTRY_a/MISSING_ENTRY_b category (perhaps this statistic was included as a sanity check? From my actual data, it always has been a zero).
This work is licensed under a Creative Commons
Attribution 4.0 International License.