Updated 2018 June 7th
Just recently, the genome Aggregation Database (gnomAD) VCF files were available for download:
The long-awaited gnomAD VCF is here - sites + frequencies for 123,136 exomes and 15,496 genomes: https://t.co/8puaTvJ45w pic.twitter.com/sxKOEVFDml
— Daniel MacArthur (@dgmacarthur) February 27, 2017
This blog post provides a bit more background on the project and how it differs from ExAC. The database is of immediate use to me as I can use it to filter out common variants, with the simple assumption that common variants should not be a causative mutation in monogenic disorders. In a previous post, I wondered how many pathogenic ClinVar variants were present in ExAC at an allele frequency (AF) greater than 0.01. The number was quite low. I wonder how many more pathogenic ClinVar variants will have an AF > 0.01 in the gnomAD. Once again, I'll use vcfanno to annotate the pathogenic ClinVar variants with the gnomAD VCF file.
To get started, download the gnomAD VCF file.
# 9G file wget -c https://storage.googleapis.com/gnomad-public/release-170228/vcf/exomes/gnomad.exomes.r2.0.1.sites.vcf.gz # index tabix -p vcf gnomad.exomes.r2.0.1.sites.vcf.gz time zcat gnomad.exomes.r2.0.1.sites.vcf.gz | grep -v "^#" | wc -l 15014473 real 5m5.257s user 5m1.116s sys 1m12.432s time zcat ExAC.r0.3.1.sites.vep.vcf.gz | grep -v "^#" | wc -l 9362318 real 2m26.440s user 2m22.540s sys 0m37.480s
There are 5,652,155 more variants in the gnomAD. Now to run vcfanno.
cat conf.toml [[annotation]] file="gnomad.exomes.r2.0.1.sites.vcf.gz" fields = ["AF"] ops=["first"] # run vcfanno on 16 cores vcfanno -p 16 conf.toml clinvar_20170104_class_5.vcf.gz > clinvar_20170104_class_5_gnomad.vcf ============================================= vcfanno version 0.1.1-alpha [built with go1.8beta1] see: https://github.com/brentp/vcfanno ============================================= vcfanno.go:114: found 1 sources from 1 files bix.go:224: chromosome MT not found in gnomad.exomes.r2.0.1.sites.vcf.gz vcfanno.go:220: annotated 40651 variants in 46.67 seconds (871.0 / second) bgzip clinvar_20170104_class_5_gnomad.vcf tabix -p vcf clinvar_20170104_class_5_gnomad.vcf.gz # find variants that are at a high frequency bcftools query -f '%CHROM\t%POS\t%REF\t%ALT\t%ID\t%AF\n' clinvar_20170104_class_5_gnomad.vcf.gz | sort -k6,6gr | head 12 121437382 A G rs1169305 0.9966 1 169519049 T C rs6025 0.9802 22 18901004 C T rs450046 0.9282 1 100672060 T C rs12021720 0.9179 12 122295335 T C rs1154510 0.8436 1 98348885 G A rs1801265 0.7709 7 150696111 T G rs1799983 0.7497 1 161599693 T C rs448740 0.6176 4 187158034 G A rs3733402 0.5385 11 18290859 C T rs1136743 0.4799 # total variants bcftools query -f '%AF\n' clinvar_20170104_class_5_gnomad.vcf.gz | wc -l 40651 # not in ExAC was 30928 # not in gnomAD bcftools query -f '%AF\n' clinvar_20170104_class_5_gnomad.vcf.gz | grep "^\.$" | wc -l 28435 # percentage not in ExAC ~76.1% # percentage not in gnomAD ~70.0% bc -l<<<28435/40651*100 69.94907874345034562400 bcftools query -f '%ID\t%AF\n' clinvar_20170104_class_5_gnomad.vcf.gz | grep -v "^\.$" > gnomad_af.txt cat gnomad_af.txt | wc -l 12216 # some variants have more than one frequency cat gnomad_af.txt | grep , | head -5 4.07e-06,8.1399e-06 0.0002294,0.00012187 1.219e-05,6.0949e-06 4.0665e-06,0 9.3694e-05,3.2589e-05,8.1473e-06 cat gnomad_af.txt | grep , | wc -l 2392 # split the AFs onto multiple lines cat gnomad_af.txt | tr ',' '\n' > temp && mv -f temp gnomad_af.txt cat gnomad_af.txt | wc -l 14896
As expected, a higher percentage of the pathogenic ClinVar variants are in the gnomAD. How many are have an AF > 0.01?
df <- read.table('gnomad_af.txt', header=FALSE, colClasses = 'numeric')
names(df) <- 'gnomad_af'
table(df$gnomad_af > 0.01)
FALSE  TRUE 
14800    96 
There are 3 additional pathogenic variants that have an AF > 0.01 (there were 93 using ExAC); below is a short script I wrote that extracts these pathogenic variants and saves them as a VCF file.
#!/usr/bin/env perl
use strict;
use warnings;
my $infile = "clinvar_20170104_class_5_gnomad.vcf.gz";
open(IN, '-|', "gunzip -c $infile") || die "Could not open $infile: $!\n";
while(<IN>){
   chomp;
   print "$_\n" if /^#/;
   if (/;AF=(.*)$/){
      my @af = split(/,/, $1);
      my $c = 0;
      foreach my $af (@af){
         $c = 1 if $af > 0.01;
      }
      print "$_\n" if $c == 1;
   } else {
      next;
   }
}
close(IN);
exit(0);
Now to run the script and save the output.
get_var.pl > clinvar_20170104_class_5_gnomad_af_0.01.vcf
I have uploaded this VCF file to figshare, so please cite it if you use it for your work.
Summary
I had initially thought that there would be more variants with an AF > 0.01 but it doesn't look like it. How many possible exonic variants has the gnomAD captured? Using dbNSFP as an estimation, it's around 18% (15014473/83306036*100).
#dbNSFP time cat db*variant.chr* | grep -v "^#" | wc -l 83306036 real 2m6.005s user 0m24.484s sys 2m56.144s

This work is licensed under a Creative Commons
Attribution 4.0 International License.
                    
Dear Mr. Tang,
I read your blog post about path variants >1% in gnomAD with great interest.
Have you published this list of 96 anywhere? I’ve perused the literature, and have not been able to find such a list, however, it would be very helpful for general use by the genetics diagnostic community. It’s unlikely that this list would change dramatically over time, even as the databases get bigger.
Have you run the same query for likely pathogenic variants, or would you be willing to?
I am not familiar with vcfanno, or coding in general, otherwise I would be off to try this myself. Thanks for publishing your list of 96, and a Likely Path list too, if you are willing?
Thanks,
Renee
Hi Renee,
I have now uploaded the variants as a VCF file to figshare: https://doi.org/10.6084/m9.figshare.6455942.v1. I hope it is useful for your work.
Best,
Dave
HI are you able to repeat this analysis using the latest gnomAD v4.1 and ClinVar data and pull the allele frequencies for 2-4* variants in ClinVar (PractiseGuideline, Expert Panel, MultipleSubmitters_NoConflict). gnomAD Max AF (ancestry), gnomAD Overall AF, genome FAF and Exome FAF, gnomAD Allele Count, and gnomAD Hom / Hemi count. This will be of incredible value to the clinical community.
Thanks for the comment. I’ll see what I can do.