ExAC allele frequency of pathogenic ClinVar variants

A continuation of the post on the genomic location of pathogenic ClinVar variants. For this post I will use vcfanno to annotate the ClinVar variants with the ExAC VCF file.

To get started, download the ExAC VCF file.

# 4.1G file
wget -c ftp://ftp.broadinstitute.org/pub/ExAC_release/release0.3.1/ExAC.r0.3.1.sites.vep.vcf.gz
wget -c ftp://ftp.broadinstitute.org/pub/ExAC_release/release0.3.1/ExAC.r0.3.1.sites.vep.vcf.gz.tbi

Next download vcfanno according to your operating system and create a configuration file (conf.toml) that points to the ExAC VCF file along with the INFO fields you are interested in. Finally we can run vcfanno. (Instructions for creating clinvar_20170104_class_5.vcf.gz are in my previous post.)

cat conf.toml 
[[annotation]]
file="ExAC.r0.3.1.sites.vep.vcf.gz"
fields = ["AF", "AC_Het", "AC_Hom", "VQSLOD"]
ops=["first", "first", "first", "first"]

# run vcfanno on 16 cores
vcfanno -p 16 conf.toml clinvar_20170104_class_5.vcf.gz > clinvar_20170104_class_5_exac.vcf

=============================================
vcfanno version 0.1.1-alpha [built with go1.8beta1]

see: https://github.com/brentp/vcfanno
=============================================
vcfanno.go:114: found 4 sources from 1 files
bix.go:224: chromosome MT not found in ExAC.r0.3.1.sites.vep.vcf.gz
vcfanno.go:220: annotated 40651 variants in 25.63 seconds (1586.2 / second)

bgzip clinvar_20170104_class_5_exac.vcf
tabix -p vcf clinvar_20170104_class_5_exac.vcf.gz

# find variants that are at a high frequency
bcftools query -f '%CHROM\t%POS\t%REF\t%ALT\t%ID\t%AF\t%AC_Het\t%AC_Hom\n' clinvar_20170104_class_5_exac.vcf.gz | sort -k6,6gr | head
12	121437382	A	G	rs1169305	0.996	460	57820
1	169519049	T	C	rs6025	0.978	2520	58097
22	18901004	C	T	rs450046	0.915	5146	27922
1	100672060	T	C	rs12021720	0.914	9188	50877
12	122295335	T	C	rs1154510	0.85	14873	44159
1	98348885	G	A	rs1801265	0.765	20943	35865
7	150696111	T	G	rs1799983	0.751	21151	34567
1	161599693	T	C	rs448740	0.588	28152	21003
4	187158034	G	A	rs3733402	0.536	28866	18108
11	18290859	C	T	rs1136743	0.487	21903	6885

# total variants
bcftools query -f '%AF\n' clinvar_20170104_class_5_exac.vcf.gz | wc -l
40651

# not in ExAC
bcftools query -f '%AF\n' clinvar_20170104_class_5_exac.vcf.gz | grep "^\.$" | wc -l
30928

# percentage not in ExAC
bc -l<<<30928/40651*100
76.08176920616958992300

bcftools query -f '%AF\n' clinvar_20170104_class_5_exac.vcf.gz | grep -v "^\.$" > exac_af.txt
cat exac_af.txt | wc -l
9723

# some variants have more than one frequency
cat exac_af.txt | grep , | head -5
2.488e-05,2.488e-05
7.414e-05,4.943e-05
8.259e-06,1.652e-05
0.0002064,1.651e-05
0.00127,4.122e-05

cat exac_af.txt | grep , | wc -l
1214

# split the AFs onto multiple lines
cat exac_af.txt | tr ',' '\n' > temp && mv -f temp exac_af.txt
cat exac_af.txt | wc -l
11042

Plot in R.

df <- read.table('exac_af.txt', header=FALSE, colClasses = 'numeric')
names(df) <- 'exac_af'
library(ggplot2)
ggplot(data = df, aes(x = log10(exac_af))) + geom_density() + ggtitle('ExAC allele frequency of pathogenic ClinVar variants') + theme_set(theme_gray(base_size = 20))

What's the second peak?

unique_af <- unique(df$exac_af)
length(unique_af)
[1] 1506

# create a table using the unique
# allele frequencies
my_table <- table(match(df$exac_af, unique_af))

head(my_table)

  1   2   3   4   5   6 
201   8  49   6   1   1

# the 1 is the index in unique_af
# there are 201 occurrences of this AF
unique_af[1]
[1] 8.238e-06
table(df$exac_af == unique_af[1])

FALSE  TRUE 
10841   201

# store top 10 indices
top10 <- as.numeric(names(head(sort(my_table, decreasing = TRUE), 10)))
top10
 [1]   9  29  35  19  33  26  36   1 129  42

data.frame(log10_af = log10(unique_af[top10]), count = my_table[top10])
    log10_af count.Var1 count.Freq
1  -5.084284          9       2284
2  -4.783306         29       1165
3  -5.084231         35        960
4  -4.607127         19        698
5  -4.482145         33        458
6  -4.385314         26        282
7  -4.306097         36        255
8  -5.084178          1        201
9  -4.783043        129        143
10 -4.181180         42        124

data.frame(af = unique_af[top10], count = my_table[top10])
          af count.Var1 count.Freq
1  8.236e-06          9       2284
2  1.647e-05         29       1165
3  8.237e-06         35        960
4  2.471e-05         19        698
5  3.295e-05         33        458
6  4.118e-05         26        282
7  4.942e-05         36        255
8  8.238e-06          1        201
9  1.648e-05        129        143
10 6.589e-05         42        124

Summary

Around a quarter of pathogenic ClinVar variants are in ExAC. Very few of these variants have an allele frequency greater than 0.01.

table(df$exac_af > 0.01)

FALSE  TRUE 
10949    93

And vcfanno is awesome.

See also

Identification of misclassified ClinVar variants using disease population prevalence.


Creative Commons License
This work is licensed under a Creative Commons
Attribution 4.0 International License
.

Leave a Reply

Your email address will not be published. Required fields are marked *

Time limit is exhausted. Please reload CAPTCHA.