From Dave's wiki
Jump to navigation Jump to search


time zcat ExAC.r0.3.sites.vep.vcf.gz | grep -v "^#" | wc -l

real    3m21.773s
user    1m40.732s
sys     0m21.904s

Annotating the dataset


convert2annovar.pl -includeinfo -allsample -withfreq -format vcf4 ExAC.r0.3.sites.vep.vcf.gz > exac.avinput
annotate_variation.pl -geneanno -buildver hg19 exac.avinput humandb/

zcat exac.avinput.variant_function.gz | cut -f1 | sort | uniq -c | sort -k1rn
5826068 exonic
3405324 intronic
 239339 UTR3
 178510 UTR5
 159181 intergenic
 143390 ncRNA_intronic
 125061 ncRNA_exonic
  62298 splicing
  27842 upstream
  15603 downstream
   3306 upstream;downstream
   1508 exonic;splicing
    940 ncRNA_splicing
    541 UTR5;UTR3
    269 ncRNA_exonic;splicing
      2 ncRNA_UTR5

zcat exac.avinput.exonic_variant_function.gz | cut -f2 | sort | uniq -c | sort -k1rn
3599443 nonsynonymous SNV
1827439 synonymous SNV
 106267 stopgain
 104504 frameshift deletion
  59632 nonframeshift deletion
  55507 unknown
  48914 frameshift insertion
  21682 nonframeshift insertion
   4187 stoploss
      1 frameshift substitution

Analysis of protein-coding genetic variation in 60,706 humans


Some definitions to understand the rest of the paper:

  • Mutational recurrence are instances in which the same mutation has occurred multiple times independently throughout the history of the sequenced populations.
  • Singleton rates are the proportion of variants seen only once in ExAC
  • They inferred independent mutational events when variants are observed in two separate populations
  • Multinucleotide polymorphisms (MNPs) are clusters of base substitutions on the same haplotype

From the abstract:

  • Large-scale reference data sets of human genetic variation are useful for the functional interpretation of sequence variants
  • The ExAC catalog has an average of one variant every eight bases of coding sequence and the presence of widespread mutational recurrence
  • Identified 3,230 genes subject to strong selection against various classes of mutation, 79% of which have no currently established human disease phenotype

From the background:

  • Whole genome sequencing (WGS) and whole exome sequencing (WES) provide a powerful source of information on the global patterns of human genetic variation and also provide critical resources for the clinical interpretation of variants observed in patients suffering from rare Mendelian diseases
  • The ExAC call set exceeds previously available exome-wide variant databases and provides unprecedented resolution for the analysis of very low-frequency genetic variants
  • The analysis of patterns of genetic variation led to the discovery of widespread mutational recurrence, the inference of gene-level constraint against truncating variation, the clinical interpretation of variation in Mendelian disease genes, and the discovery of human "knockout" variants in protein-coding genes

Variant discovery and quality control

  • They assembled 1 petabyte of raw sequencing data from 91,796 individual exomes drawn from a wide range of primarily disease-focused consortia
  • Used a new version of the GATK HaplotypeCaller pipeline
  • At each site, sequence information from all individuals was used to assess the evidence for the presence of a variant in each individual
  • >10,000 samples had been directly genotyped using SNP array (Illumina HumanExome) and they achieved a 97-99% heterozygous concordance
  • To identify the ancestry of each ExAC sample, they performed a PCA on 5,400 common SNVs that had a high coverage across all of the exome capture technologies; the PCA identified population clusters corresponding to individuals of European, African, South Asian, East Asian, and admixed American (Latino) ancestry
  • The density of variation in ExAC is not uniform across the exome

Patterns of protein-coding variation revealed by large samples

  • 7.9% of high quality sites in ExAC are multiallelic
  • Among synonymous variants, a calls of variation expected to have undergone minimal selection, 43% of validated de novo events identified in external datasets of 1,756 parent-offspring trios are also observed independently in the ExAC dataset, indicating a separate origin for the same variant within the demographic history of the two samples
  • Sites with a low predicted mutability have have a higher singleton rate (60%) compared to sites with a high predicted mutability rate (20%), i.e. sites that mutate a lot are observed more often (not singletons)

Inferring variant deleteriousness and gene constraint

  • Deleterious variants are expected to have lower allele frequencies than neutral ones, due to negative selection