Just recently, the genome Aggregation Database (gnomAD) VCF files were available for download:
— Daniel MacArthur (@dgmacarthur) February 27, 2017
This blog post provides a bit more background on the project and how it differs from ExAC. The database is of immediate use to me as I can use it to filter out common variants, with the simple assumption that common variants should not be a causative mutation in monogenic disorders. In a previous post, I wondered how many pathogenic ClinVar variants were present in ExAC at an allele frequency (AF) greater than 0.01. The number was quite low. I wonder how many more pathogenic ClinVar variants will have an AF > 0.01 in the gnomAD. Once again, I'll use vcfanno to annotate the pathogenic ClinVar variants with the gnomAD VCF file.
To get started, download the gnomAD VCF file.
# 9G file wget -c https://storage.googleapis.com/gnomad-public/release-170228/vcf/exomes/gnomad.exomes.r2.0.1.sites.vcf.gz # index tabix -p vcf gnomad.exomes.r2.0.1.sites.vcf.gz time zcat gnomad.exomes.r2.0.1.sites.vcf.gz | grep -v "^#" | wc -l 15014473 real 5m5.257s user 5m1.116s sys 1m12.432s time zcat ExAC.r0.3.1.sites.vep.vcf.gz | grep -v "^#" | wc -l 9362318 real 2m26.440s user 2m22.540s sys 0m37.480s
There are 5,652,155 more variants in the gnomAD. Now to run vcfanno.
cat conf.toml [[annotation]] file="gnomad.exomes.r2.0.1.sites.vcf.gz" fields = ["AF"] ops=["first"] # run vcfanno on 16 cores vcfanno -p 16 conf.toml clinvar_20170104_class_5.vcf.gz > clinvar_20170104_class_5_gnomad.vcf ============================================= vcfanno version 0.1.1-alpha [built with go1.8beta1] see: https://github.com/brentp/vcfanno ============================================= vcfanno.go:114: found 1 sources from 1 files bix.go:224: chromosome MT not found in gnomad.exomes.r2.0.1.sites.vcf.gz vcfanno.go:220: annotated 40651 variants in 46.67 seconds (871.0 / second) bgzip clinvar_20170104_class_5_gnomad.vcf tabix -p vcf clinvar_20170104_class_5_gnomad.vcf.gz # find variants that are at a high frequency bcftools query -f '%CHROM\t%POS\t%REF\t%ALT\t%ID\t%AF\n' clinvar_20170104_class_5_gnomad.vcf.gz | sort -k6,6gr | head 12 121437382 A G rs1169305 0.9966 1 169519049 T C rs6025 0.9802 22 18901004 C T rs450046 0.9282 1 100672060 T C rs12021720 0.9179 12 122295335 T C rs1154510 0.8436 1 98348885 G A rs1801265 0.7709 7 150696111 T G rs1799983 0.7497 1 161599693 T C rs448740 0.6176 4 187158034 G A rs3733402 0.5385 11 18290859 C T rs1136743 0.4799 # total variants bcftools query -f '%AF\n' clinvar_20170104_class_5_gnomad.vcf.gz | wc -l 40651 # not in ExAC was 30928 # not in gnomAD bcftools query -f '%AF\n' clinvar_20170104_class_5_gnomad.vcf.gz | grep "^\.$" | wc -l 28435 # percentage not in ExAC ~76.1% # percentage not in gnomAD ~70.0% bc -l<<<28435/40651*100 69.94907874345034562400 bcftools query -f '%ID\t%AF\n' clinvar_20170104_class_5_gnomad.vcf.gz | grep -v "^\.$" > gnomad_af.txt cat gnomad_af.txt | wc -l 12216 # some variants have more than one frequency cat gnomad_af.txt | grep , | head -5 4.07e-06,8.1399e-06 0.0002294,0.00012187 1.219e-05,6.0949e-06 4.0665e-06,0 9.3694e-05,3.2589e-05,8.1473e-06 cat gnomad_af.txt | grep , | wc -l 2392 # split the AFs onto multiple lines cat gnomad_af.txt | tr ',' '\n' > temp && mv -f temp gnomad_af.txt cat gnomad_af.txt | wc -l 14896
As expected, a higher percentage of the pathogenic ClinVar variants are in the gnomAD. How many are have an AF > 0.01?
df <- read.table('gnomad_af.txt', header=FALSE, colClasses = 'numeric') names(df) <- 'gnomad_af' table(df$gnomad_af > 0.01) FALSE TRUE 14800 96
There are 3 additional pathogenic variants that have an AF > 0.01 (there were 93 using ExAC).
I had initially thought that there would be more variants with an AF > 0.01 but it doesn't look like it. How many possible exonic variants has the gnomAD captured? Using dbNSFP as an estimation, it's around 18% (15014473/83306036*100).
#dbNSFP time cat db*variant.chr* | grep -v "^#" | wc -l 83306036 real 2m6.005s user 0m24.484s sys 2m56.144s
This work is licensed under a Creative Commons
Attribution 4.0 International License.