VCF concordance

I want to compare the genotype concordance between two VCF files and I came across SnpSift, which seems to calculate the statistics that I want. However, the format of the results from my run differ from the format in the documentation. In this post, I will try to come up with the exact scenarios that fall into each of the summary statistics (to satisfy my curiosity).

Firstly download the program:

# the latest version as of this post is snpEff_v4_1i_core.zip
wget http://downloads.sourceforge.net/project/snpeff/snpEff_latest_core.zip

To get started, I created a test VCF file (a.vcf) and made a copy of it (b.vcf).

cat a.vcf 
##fileformat=VCFv4.2
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  a       b       c
1       31      rs31    A       G       .       .       .       GT      0/0     0/0     0/0
cp a.vcf b.vcf

Running SnpSift concordance on a.vcf and b.vcf:

java -Xmx1g -jar \
/home/san/dtang/src/snpEff/SnpSift.jar concordance \
-v a.vcf b.vcf \
> concordance_a_b.by_variant.txt

To make the results easier to peruse, I'll use my transpose script.

cat concordance_a_b.by_variant.txt  | transpose
chr     1
pos     31
ref     A
alt     G
MISSING_ENTRY_a/MISSING_ENTRY_b 0
MISSING_ENTRY_a/MISSING_GT_b    0
MISSING_ENTRY_a/REF     0
MISSING_ENTRY_a/ALT_1   0
MISSING_ENTRY_a/ALT_2   0
MISSING_GT_a/MISSING_ENTRY_b    0
MISSING_GT_a/MISSING_GT_b       0
MISSING_GT_a/REF        0
MISSING_GT_a/ALT_1      0
MISSING_GT_a/ALT_2      0
REF/MISSING_ENTRY_b     0
REF/MISSING_GT_b        0
REF/REF 3
REF/ALT_1       0
REF/ALT_2       0
ALT_1/MISSING_ENTRY_b   0
ALT_1/MISSING_GT_b      0
ALT_1/REF       0
ALT_1/ALT_1     0
ALT_1/ALT_2     0
ALT_2/MISSING_ENTRY_b   0
ALT_2/MISSING_GT_b      0
ALT_2/REF       0
ALT_2/ALT_1     0
ALT_2/ALT_2     0
ERROR   0

The summary statistics are made up of pairwise combinations of five categories: MISSING_ENTRY, MISSING_GT, REF, ALT_1, and ALT_2. From the results, there are three counts of REF/REF, which I assume means that there were three matching genotypes between the two files and the genotypes were homozygous reference.

However, I am not sure what MISSING_ENTRY_a/MISSING_ENTRY_b constitutes. If an entry is missing in a.vcf and also missing in b.vcf, that means it doesn't exist right? Therefore, I'll start creating scenarios from the second summary statistic: MISSING_ENTRY_a/MISSING_GT_b. To save some space, I will not paste the code for SnpSift, which was run exactly in the manner described above.

MISSING_ENTRY_a/MISSING_GT_b

cat a.vcf 
##fileformat=VCFv4.2
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  a       b       c
1       31      .       A       G       .       .       .       GT      0/0     0/0     0/0

cat b.vcf 
##fileformat=VCFv4.2
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  a       b       c
1       32      .       A       G       .       .       .       GT      0/0     0/0     ./.

cat concordance_a_b.by_variant.txt | transpose
chr     1       1       1
pos     31      32      31
ref     A       A       A
alt     G       G       G
MISSING_ENTRY_a/MISSING_ENTRY_b 0       0       0
MISSING_ENTRY_a/MISSING_GT_b    0       1       0
MISSING_ENTRY_a/REF     0       2       0
MISSING_ENTRY_a/ALT_1   0       0       0
MISSING_ENTRY_a/ALT_2   0       0       0
MISSING_GT_a/MISSING_ENTRY_b    0       0       0
MISSING_GT_a/MISSING_GT_b       0       0       0
MISSING_GT_a/REF        0       0       0
MISSING_GT_a/ALT_1      0       0       0
MISSING_GT_a/ALT_2      0       0       0
REF/MISSING_ENTRY_b     3       0       3
REF/MISSING_GT_b        0       0       0
REF/REF 0       0       0
REF/ALT_1       0       0       0
REF/ALT_2       0       0       0
ALT_1/MISSING_ENTRY_b   0       0       0
ALT_1/MISSING_GT_b      0       0       0
ALT_1/REF       0       0       0
ALT_1/ALT_1     0       0       0
ALT_1/ALT_2     0       0       0
ALT_2/MISSING_ENTRY_b   0       0       0
ALT_2/MISSING_GT_b      0       0       0
ALT_2/REF       0       0       0
ALT_2/ALT_1     0       0       0
ALT_2/ALT_2     0       0       0
ERROR   0       0       0

From the above example, we also created the MISSING_ENTRY_a/REF scenario, where an entry is in b (SNP at pos 32) but not a and the genotypes from this missing entry are homozygous reference. We also created the REF/MISSING_ENTRY_b scenario, where an entry is in a (SNP at pos 31) but not in b and the genotypes from this missing entry are homozygous reference.

MISSING_ENTRY_a/ALT_1

Having realised that I can shorten this post substantially by creating multiple scenarios in one example, here are the next two VCF files:

cat a.vcf 
##fileformat=VCFv4.2
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  a       b       c
1       31      .       A       G       .       .       .       GT      0/0     0/1     1/1

cat b.vcf 
##fileformat=VCFv4.2
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  a       b       c
1       32      .       A       G       .       .       .       GT      0/0     0/1     1/1

cat concordance_a_b.by_variant.txt | transpose 
chr     1       1       1
pos     31      32      31
ref     A       A       A
alt     G       G       G
MISSING_ENTRY_a/MISSING_ENTRY_b 0       0       0
MISSING_ENTRY_a/MISSING_GT_b    0       0       0
MISSING_ENTRY_a/REF     0       1       0
MISSING_ENTRY_a/ALT_1   0       1       0
MISSING_ENTRY_a/ALT_2   0       1       0
MISSING_GT_a/MISSING_ENTRY_b    0       0       0
MISSING_GT_a/MISSING_GT_b       0       0       0
MISSING_GT_a/REF        0       0       0
MISSING_GT_a/ALT_1      0       0       0
MISSING_GT_a/ALT_2      0       0       0
REF/MISSING_ENTRY_b     1       0       1
REF/MISSING_GT_b        0       0       0
REF/REF 0       0       0
REF/ALT_1       0       0       0
REF/ALT_2       0       0       0
ALT_1/MISSING_ENTRY_b   1       0       1
ALT_1/MISSING_GT_b      0       0       0
ALT_1/REF       0       0       0
ALT_1/ALT_1     0       0       0
ALT_1/ALT_2     0       0       0
ALT_2/MISSING_ENTRY_b   1       0       1
ALT_2/MISSING_GT_b      0       0       0
ALT_2/REF       0       0       0
ALT_2/ALT_1     0       0       0
ALT_2/ALT_2     0       0       0
ERROR   0       0       0

The three statistics MISSING_ENTRY_a/REF, MISSING_ENTRY_a/ALT_1, and MISSING_ENTRY_a/ALT_2 are now covered. And so are the other three statistics REF/MISSING_ENTRY_b, ALT_1/MISSING_ENTRY_b, and ALT_2/MISSING_ENTRY_b.

MISSING_GT_a/MISSING_ENTRY_b

Here are the next two VCF files:

cat a.vcf 
##fileformat=VCFv4.2
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  a       b       c
1       31      .       A       G       .       .       .       GT      0/0     0/1     ./.

cat b.vcf 
##fileformat=VCFv4.2
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  a       b       c
1       32      .       A       G       .       .       .       GT      0/0     0/1     ./.

cat concordance_a_b.by_variant.txt | transpose
chr     1       1       1
pos     31      32      31
ref     A       A       A
alt     G       G       G
MISSING_ENTRY_a/MISSING_ENTRY_b 0       0       0
MISSING_ENTRY_a/MISSING_GT_b    0       1       0
MISSING_ENTRY_a/REF     0       1       0
MISSING_ENTRY_a/ALT_1   0       1       0
MISSING_ENTRY_a/ALT_2   0       0       0
MISSING_GT_a/MISSING_ENTRY_b    1       0       1
MISSING_GT_a/MISSING_GT_b       0       0       0
MISSING_GT_a/REF        0       0       0
MISSING_GT_a/ALT_1      0       0       0
MISSING_GT_a/ALT_2      0       0       0
REF/MISSING_ENTRY_b     1       0       1
REF/MISSING_GT_b        0       0       0
REF/REF 0       0       0
REF/ALT_1       0       0       0
REF/ALT_2       0       0       0
ALT_1/MISSING_ENTRY_b   1       0       1
ALT_1/MISSING_GT_b      0       0       0
ALT_1/REF       0       0       0
ALT_1/ALT_1     0       0       0
ALT_1/ALT_2     0       0       0
ALT_2/MISSING_ENTRY_b   0       0       0
ALT_2/MISSING_GT_b      0       0       0
ALT_2/REF       0       0       0
ALT_2/ALT_1     0       0       0
ALT_2/ALT_2     0       0       0
ERROR   0       0       0

MISSING_GT_a/MISSING_GT_b

While coming up with this scenario, I had another epiphany; if I add another sample, I can kill 4 birds with one stone.

cat a.vcf 
##fileformat=VCFv4.2
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  a       b       c       d
1       31      .       A       G       .       .       .       GT      ./.     ./.     ./.     ./.

cat b.vcf 
##fileformat=VCFv4.2
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  a       b       c       d
1       31      .       A       G       .       .       .       GT      0/0     0/1     1/1     ./.

cat concordance_a_b.by_variant.txt | transpose
chr     1
pos     31
ref     A
alt     G
MISSING_ENTRY_a/MISSING_ENTRY_b 0
MISSING_ENTRY_a/MISSING_GT_b    0
MISSING_ENTRY_a/REF     0
MISSING_ENTRY_a/ALT_1   0
MISSING_ENTRY_a/ALT_2   0
MISSING_GT_a/MISSING_ENTRY_b    0
MISSING_GT_a/MISSING_GT_b       1
MISSING_GT_a/REF        1
MISSING_GT_a/ALT_1      1
MISSING_GT_a/ALT_2      1
REF/MISSING_ENTRY_b     0
REF/MISSING_GT_b        0
REF/REF 0
REF/ALT_1       0
REF/ALT_2       0
ALT_1/MISSING_ENTRY_b   0
ALT_1/MISSING_GT_b      0
ALT_1/REF       0
ALT_1/ALT_1     0
ALT_1/ALT_2     0
ALT_2/MISSING_ENTRY_b   0
ALT_2/MISSING_GT_b      0
ALT_2/REF       0
ALT_2/ALT_1     0
ALT_2/ALT_2     0
ERROR   0

Now we have covered MISSING_GT_a/MISSING_GT_b, MISSING_GT_a/REF, MISSING_GT_a/ALT_1, and MISSING_GT_a/ALT_2.

REF/MISSING_GT_b

Here are the VCF files:

cat a.vcf 
##fileformat=VCFv4.2
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  a       b       c       d
1       31      .       A       G       .       .       .       GT      0/0     0/0     0/0     0/0

cat b.vcf 
##fileformat=VCFv4.2
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  a       b       c       d
1       31      .       A       G       .       .       .       GT      0/0     0/1     1/1     ./.

cat concordance_a_b.by_variant.txt | transpose
chr     1
pos     31
ref     A
alt     G
MISSING_ENTRY_a/MISSING_ENTRY_b 0
MISSING_ENTRY_a/MISSING_GT_b    0
MISSING_ENTRY_a/REF     0
MISSING_ENTRY_a/ALT_1   0
MISSING_ENTRY_a/ALT_2   0
MISSING_GT_a/MISSING_ENTRY_b    0
MISSING_GT_a/MISSING_GT_b       0
MISSING_GT_a/REF        0
MISSING_GT_a/ALT_1      0
MISSING_GT_a/ALT_2      0
REF/MISSING_ENTRY_b     0
REF/MISSING_GT_b        1
REF/REF 1
REF/ALT_1       1
REF/ALT_2       1
ALT_1/MISSING_ENTRY_b   0
ALT_1/MISSING_GT_b      0
ALT_1/REF       0
ALT_1/ALT_1     0
ALT_1/ALT_2     0
ALT_2/MISSING_ENTRY_b   0
ALT_2/MISSING_GT_b      0
ALT_2/REF       0
ALT_2/ALT_1     0
ALT_2/ALT_2     0
ERROR   0

Now that you (hopefully) see where this is going, I'm not going to bother with the ALT_1/etc. and the ALT_2/etc. examples. I'll finish this post with the ERROR statistic:

cat a.vcf 
##fileformat=VCFv4.2
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  a
1       31      .       A       G       .       .       .       GT      0/0
1       32      .       C       T       .       .       .       GT      0/0

cat b.vcf 
##fileformat=VCFv4.2
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  a
1       31      .       G       A       .       .       .       GT      0/0
1       32      .       T       C       .       .       .       GT      0/0

cat concordance_a_b.by_variant.txt | transpose
chr     1       1
pos     31      32
ref     A       C
alt     G       T
MISSING_ENTRY_a/MISSING_ENTRY_b 0       0
MISSING_ENTRY_a/MISSING_GT_b    0       0
MISSING_ENTRY_a/REF     0       0
MISSING_ENTRY_a/ALT_1   0       0
MISSING_ENTRY_a/ALT_2   0       0
MISSING_GT_a/MISSING_ENTRY_b    0       0
MISSING_GT_a/MISSING_GT_b       0       0
MISSING_GT_a/REF        0       0
MISSING_GT_a/ALT_1      0       0
MISSING_GT_a/ALT_2      0       0
REF/MISSING_ENTRY_b     0       0
REF/MISSING_GT_b        0       0
REF/REF 0       0
REF/ALT_1       0       0
REF/ALT_2       0       0
ALT_1/MISSING_ENTRY_b   0       0
ALT_1/MISSING_GT_b      0       0
ALT_1/REF       0       0
ALT_1/ALT_1     0       0
ALT_1/ALT_2     0       0
ALT_2/MISSING_ENTRY_b   0       0
ALT_2/MISSING_GT_b      0       0
ALT_2/REF       0       0
ALT_2/ALT_1     0       0
ALT_2/ALT_2     0       0
ERROR   1       1
ERROR_ALT_DOES_NOT_MATCH:G/A    ERROR_ALT_DOES_NOT_MATCH:T/C

Summary

This post was on calculating the VCF concordance between two VCF files. I found a program called SnpSift, which could calculate the statistics that I wanted, however the results generated from the latest version is not documented (but quite self-explanatory). Therefore, to satisfy myself I created various test cases that would generate the various statistics calculated by the program. However, after all of this, I am still not sure how I can create an example that falls into the MISSING_ENTRY_a/MISSING_ENTRY_b category (perhaps this statistic was included as a sanity check? From my actual data, it always has been a zero).

Print Friendly, PDF & Email



Creative Commons License
This work is licensed under a Creative Commons
Attribution 4.0 International License
.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.