gnomAD allele frequency of pathogenic ClinVar variants

Updated 2018 June 7th

Just recently, the genome Aggregation Database (gnomAD) VCF files were available for download:

This blog post provides a bit more background on the project and how it differs from ExAC. The database is of immediate use to me as I can use it to filter out common variants, with the simple assumption that common variants should not be a causative mutation in monogenic disorders. In a previous post, I wondered how many pathogenic ClinVar variants were present in ExAC at an allele frequency (AF) greater than 0.01. The number was quite low. I wonder how many more pathogenic ClinVar variants will have an AF > 0.01 in the gnomAD. Once again, I’ll use vcfanno to annotate the pathogenic ClinVar variants with the gnomAD VCF file.

To get started, download the gnomAD VCF file.

# 9G file
wget -c
# index
tabix -p vcf gnomad.exomes.r2.0.1.sites.vcf.gz

time zcat gnomad.exomes.r2.0.1.sites.vcf.gz | grep -v "^#" | wc -l

real    5m5.257s
user    5m1.116s
sys     1m12.432s

time zcat ExAC.r0.3.1.sites.vep.vcf.gz | grep -v "^#" | wc -l

real    2m26.440s
user    2m22.540s
sys     0m37.480s

There are 5,652,155 more variants in the gnomAD. Now to run vcfanno.

cat conf.toml 
fields = ["AF"]

# run vcfanno on 16 cores
vcfanno -p 16 conf.toml clinvar_20170104_class_5.vcf.gz > clinvar_20170104_class_5_gnomad.vcf

vcfanno version 0.1.1-alpha [built with go1.8beta1]

vcfanno.go:114: found 1 sources from 1 files
bix.go:224: chromosome MT not found in gnomad.exomes.r2.0.1.sites.vcf.gz
vcfanno.go:220: annotated 40651 variants in 46.67 seconds (871.0 / second)

bgzip clinvar_20170104_class_5_gnomad.vcf
tabix -p vcf clinvar_20170104_class_5_gnomad.vcf.gz

# find variants that are at a high frequency
bcftools query -f '%CHROM\t%POS\t%REF\t%ALT\t%ID\t%AF\n' clinvar_20170104_class_5_gnomad.vcf.gz | sort -k6,6gr | head
12      121437382       A       G       rs1169305       0.9966
1       169519049       T       C       rs6025  0.9802
22      18901004        C       T       rs450046        0.9282
1       100672060       T       C       rs12021720      0.9179
12      122295335       T       C       rs1154510       0.8436
1       98348885        G       A       rs1801265       0.7709
7       150696111       T       G       rs1799983       0.7497
1       161599693       T       C       rs448740        0.6176
4       187158034       G       A       rs3733402       0.5385
11      18290859        C       T       rs1136743       0.4799

# total variants
bcftools query -f '%AF\n' clinvar_20170104_class_5_gnomad.vcf.gz | wc -l

# not in ExAC was 30928
# not in gnomAD
bcftools query -f '%AF\n' clinvar_20170104_class_5_gnomad.vcf.gz | grep "^\.$" | wc -l

# percentage not in ExAC ~76.1%
# percentage not in gnomAD ~70.0%
bc -l<<<28435/40651*100

bcftools query -f '%ID\t%AF\n' clinvar_20170104_class_5_gnomad.vcf.gz | grep -v "^\.$" > gnomad_af.txt
cat gnomad_af.txt | wc -l

# some variants have more than one frequency
cat gnomad_af.txt | grep , | head -5

cat gnomad_af.txt | grep , | wc -l

# split the AFs onto multiple lines
cat gnomad_af.txt | tr ',' '\n' > temp && mv -f temp gnomad_af.txt
cat gnomad_af.txt | wc -l

As expected, a higher percentage of the pathogenic ClinVar variants are in the gnomAD. How many are have an AF > 0.01?

df <- read.table('gnomad_af.txt', header=FALSE, colClasses = 'numeric')
names(df) <- 'gnomad_af'

table(df$gnomad_af > 0.01)

14800    96 

There are 3 additional pathogenic variants that have an AF > 0.01 (there were 93 using ExAC); below is a short script I wrote that extracts these pathogenic variants and saves them as a VCF file.

#!/usr/bin/env perl

use strict;
use warnings;

my $infile = "clinvar_20170104_class_5_gnomad.vcf.gz";

open(IN, '-|', "gunzip -c $infile") || die "Could not open $infile: $!\n";

   print "$_\n" if /^#/;
   if (/;AF=(.*)$/){
      my @af = split(/,/, $1);
      my $c = 0;
      foreach my $af (@af){
         $c = 1 if $af > 0.01;
      print "$_\n" if $c == 1;
   } else {


Now to run the script and save the output. > clinvar_20170104_class_5_gnomad_af_0.01.vcf

I have uploaded this VCF file to figshare, so please cite it if you use it for your work.


I had initially thought that there would be more variants with an AF > 0.01 but it doesn’t look like it. How many possible exonic variants has the gnomAD captured? Using dbNSFP as an estimation, it’s around 18% (15014473/83306036*100).

time cat db*variant.chr* | grep -v "^#" | wc -l

real    2m6.005s
user    0m24.484s
sys     2m56.144s
Print Friendly, PDF & Email

Creative Commons License
This work is licensed under a Creative Commons
Attribution 4.0 International License
2 comments Add yours
  1. Dear Mr. Tang,

    I read your blog post about path variants >1% in gnomAD with great interest.

    Have you published this list of 96 anywhere? I’ve perused the literature, and have not been able to find such a list, however, it would be very helpful for general use by the genetics diagnostic community. It’s unlikely that this list would change dramatically over time, even as the databases get bigger.

    Have you run the same query for likely pathogenic variants, or would you be willing to?

    I am not familiar with vcfanno, or coding in general, otherwise I would be off to try this myself. Thanks for publishing your list of 96, and a Likely Path list too, if you are willing?


Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.