gnomAD allele frequency of pathogenic ClinVar variants

Just recently, the genome Aggregation Database (gnomAD) VCF files were available for download:

This blog post provides a bit more background on the project and how it differs from ExAC. The database is of immediate use to me as I can use it to filter out common variants, with the simple assumption that common variants should not be a causative mutation in monogenic disorders. In a previous post, I wondered how many pathogenic ClinVar variants were present in ExAC at an allele frequency (AF) greater than 0.01. The number was quite low. I wonder how many more pathogenic ClinVar variants will have an AF > 0.01 in the gnomAD. Once again, I'll use vcfanno to annotate the pathogenic ClinVar variants with the gnomAD VCF file.

To get started, download the gnomAD VCF file.

# 9G file
wget -c https://storage.googleapis.com/gnomad-public/release-170228/vcf/exomes/gnomad.exomes.r2.0.1.sites.vcf.gz
# index
tabix -p vcf gnomad.exomes.r2.0.1.sites.vcf.gz

time zcat gnomad.exomes.r2.0.1.sites.vcf.gz | grep -v "^#" | wc -l
15014473

real    5m5.257s
user    5m1.116s
sys     1m12.432s

time zcat ExAC.r0.3.1.sites.vep.vcf.gz | grep -v "^#" | wc -l
9362318

real    2m26.440s
user    2m22.540s
sys     0m37.480s

There are 5,652,155 more variants in the gnomAD. Now to run vcfanno.

cat conf.toml 
[[annotation]]
file="gnomad.exomes.r2.0.1.sites.vcf.gz"
fields = ["AF"]
ops=["first"]

# run vcfanno on 16 cores
vcfanno -p 16 conf.toml clinvar_20170104_class_5.vcf.gz > clinvar_20170104_class_5_gnomad.vcf

=============================================
vcfanno version 0.1.1-alpha [built with go1.8beta1]

see: https://github.com/brentp/vcfanno
=============================================
vcfanno.go:114: found 1 sources from 1 files
bix.go:224: chromosome MT not found in gnomad.exomes.r2.0.1.sites.vcf.gz
vcfanno.go:220: annotated 40651 variants in 46.67 seconds (871.0 / second)

bgzip clinvar_20170104_class_5_gnomad.vcf
tabix -p vcf clinvar_20170104_class_5_gnomad.vcf.gz

# find variants that are at a high frequency
bcftools query -f '%CHROM\t%POS\t%REF\t%ALT\t%ID\t%AF\n' clinvar_20170104_class_5_gnomad.vcf.gz | sort -k6,6gr | head
12      121437382       A       G       rs1169305       0.9966
1       169519049       T       C       rs6025  0.9802
22      18901004        C       T       rs450046        0.9282
1       100672060       T       C       rs12021720      0.9179
12      122295335       T       C       rs1154510       0.8436
1       98348885        G       A       rs1801265       0.7709
7       150696111       T       G       rs1799983       0.7497
1       161599693       T       C       rs448740        0.6176
4       187158034       G       A       rs3733402       0.5385
11      18290859        C       T       rs1136743       0.4799

# total variants
bcftools query -f '%AF\n' clinvar_20170104_class_5_gnomad.vcf.gz | wc -l
40651

# not in ExAC was 30928
# not in gnomAD
bcftools query -f '%AF\n' clinvar_20170104_class_5_gnomad.vcf.gz | grep "^\.$" | wc -l
28435

# percentage not in ExAC ~76.1%
# percentage not in gnomAD ~70.0%
bc -l<<<28435/40651*100
69.94907874345034562400

bcftools query -f '%ID\t%AF\n' clinvar_20170104_class_5_gnomad.vcf.gz | grep -v "^\.$" > gnomad_af.txt
cat gnomad_af.txt | wc -l
12216

# some variants have more than one frequency
cat gnomad_af.txt | grep , | head -5
4.07e-06,8.1399e-06
0.0002294,0.00012187
1.219e-05,6.0949e-06
4.0665e-06,0
9.3694e-05,3.2589e-05,8.1473e-06

cat gnomad_af.txt | grep , | wc -l
2392

# split the AFs onto multiple lines
cat gnomad_af.txt | tr ',' '\n' > temp && mv -f temp gnomad_af.txt
cat gnomad_af.txt | wc -l
14896

As expected, a higher percentage of the pathogenic ClinVar variants are in the gnomAD. How many are have an AF > 0.01?

df <- read.table('gnomad_af.txt', header=FALSE, colClasses = 'numeric')
names(df) <- 'gnomad_af'

table(df$gnomad_af > 0.01)

FALSE  TRUE 
14800    96 

There are 3 additional pathogenic variants that have an AF > 0.01 (there were 93 using ExAC).

Summary

I had initially thought that there would be more variants with an AF > 0.01 but it doesn't look like it. How many possible exonic variants has the gnomAD captured? Using dbNSFP as an estimation, it's around 18% (15014473/83306036*100).

#dbNSFP
time cat db*variant.chr* | grep -v "^#" | wc -l
83306036

real    2m6.005s
user    0m24.484s
sys     2m56.144s



Creative Commons License
This work is licensed under a Creative Commons
Attribution 4.0 International License
.

Leave a Reply

Your email address will not be published. Required fields are marked *

Time limit is exhausted. Please reload CAPTCHA.