ExAC
Jump to navigation
Jump to search
Dataset
time zcat ExAC.r0.3.sites.vep.vcf.gz | grep -v "^#" | wc -l 9362318 real 3m21.773s user 1m40.732s sys 0m21.904s
Annotating the dataset
Using ANNOVAR
convert2annovar.pl -includeinfo -allsample -withfreq -format vcf4 ExAC.r0.3.sites.vep.vcf.gz > exac.avinput annotate_variation.pl -geneanno -buildver hg19 exac.avinput humandb/ zcat exac.avinput.variant_function.gz | cut -f1 | sort | uniq -c | sort -k1rn 5826068 exonic 3405324 intronic 239339 UTR3 178510 UTR5 159181 intergenic 143390 ncRNA_intronic 125061 ncRNA_exonic 62298 splicing 27842 upstream 15603 downstream 3306 upstream;downstream 1508 exonic;splicing 940 ncRNA_splicing 541 UTR5;UTR3 269 ncRNA_exonic;splicing 2 ncRNA_UTR5 zcat exac.avinput.exonic_variant_function.gz | cut -f2 | sort | uniq -c | sort -k1rn 3599443 nonsynonymous SNV 1827439 synonymous SNV 106267 stopgain 104504 frameshift deletion 59632 nonframeshift deletion 55507 unknown 48914 frameshift insertion 21682 nonframeshift insertion 4187 stoploss 1 frameshift substitution
Analysis of protein-coding genetic variation in 60,706 humans
http://biorxiv.org/content/early/2015/10/30/030338
Some definitions to understand the rest of the paper:
- Mutational recurrence are instances in which the same mutation has occurred multiple times independently throughout the history of the sequenced populations.
- Singleton rates are the proportion of variants seen only once in ExAC
- They inferred independent mutational events when variants are observed in two separate populations
- Multinucleotide polymorphisms (MNPs) are clusters of base substitutions on the same haplotype
From the abstract:
- Large-scale reference data sets of human genetic variation are useful for the functional interpretation of sequence variants
- The ExAC catalog has an average of one variant every eight bases of coding sequence and the presence of widespread mutational recurrence
- Identified 3,230 genes subject to strong selection against various classes of mutation, 79% of which have no currently established human disease phenotype
From the background:
- Whole genome sequencing (WGS) and whole exome sequencing (WES) provide a powerful source of information on the global patterns of human genetic variation and also provide critical resources for the clinical interpretation of variants observed in patients suffering from rare Mendelian diseases
- The ExAC call set exceeds previously available exome-wide variant databases and provides unprecedented resolution for the analysis of very low-frequency genetic variants
- The analysis of patterns of genetic variation led to the discovery of widespread mutational recurrence, the inference of gene-level constraint against truncating variation, the clinical interpretation of variation in Mendelian disease genes, and the discovery of human "knockout" variants in protein-coding genes
Variant discovery and quality control
- They assembled 1 petabyte of raw sequencing data from 91,796 individual exomes drawn from a wide range of primarily disease-focused consortia
- Used a new version of the GATK HaplotypeCaller pipeline
- At each site, sequence information from all individuals was used to assess the evidence for the presence of a variant in each individual
- >10,000 samples had been directly genotyped using SNP array (Illumina HumanExome) and they achieved a 97-99% heterozygous concordance
- To identify the ancestry of each ExAC sample, they performed a PCA on 5,400 common SNVs that had a high coverage across all of the exome capture technologies; the PCA identified population clusters corresponding to individuals of European, African, South Asian, East Asian, and admixed American (Latino) ancestry
- The density of variation in ExAC is not uniform across the exome
Patterns of protein-coding variation revealed by large samples
- 7.9% of high quality sites in ExAC are multiallelic
- Among synonymous variants, a calls of variation expected to have undergone minimal selection, 43% of validated de novo events identified in external datasets of 1,756 parent-offspring trios are also observed independently in the ExAC dataset, indicating a separate origin for the same variant within the demographic history of the two samples
- Sites with a low predicted mutability have have a higher singleton rate (60%) compared to sites with a high predicted mutability rate (20%), i.e. sites that mutate a lot are observed more often (not singletons)
Inferring variant deleteriousness and gene constraint
- Deleterious variants are expected to have lower allele frequencies than neutral ones, due to negative selection