Genome Analysis Toolkit

From Dave's wiki
(Redirected from GATK)
Jump to navigation Jump to search

The Genome Analysis Toolkit or GATK is a software package developed at the Broad Institute to analyze high-throughput sequencing data. The toolkit offers a wide variety of tools, with a primary focus on variant discovery and genotyping as well as strong emphasis on data quality assurance. Its robust architecture, powerful processing engine and high-performance computing features make it capable of taking on projects of any size.

https://www.broadinstitute.org/gatk/

Required files

All analyses done with the GATK typically involve several (though not necessarily all) of the following inputs:

  • Reference genome sequence
  • Sequencing reads
  • Intervals of interest
  • Reference-ordered data

See more http://gatkforums.broadinstitute.org/discussion/1204/what-input-files-does-the-gatk-accept

Pipeline

gatk_pipeline.png

(Yes I realise that I have mistakenly included the GATK Local Realignment step twice.)

Variant Quality Score Recalibration

Variant Quality Score Recalibration (VQSR) is the process of assigning accurate confidence scores to each putative mutation call. The approach is to develop a continuous, covarying estimate of the relationship between SNP call annotations (e.g. QD, SB, HaplotypeScore, HRun) and the probability that a SNP is a true genetic variant versus a sequencing or data processing artefact. In a nutshell, the tool takes the overlap of training/truth resource sets of the call set and it models the distribution of these variants to the annotations specified and attempts to group them into clusters. The clustering is used to assign VQSLOD scores to all variants; variants that are closer to the heart of a cluster will get a higher score than variants that are outliers.

The idea is that it's better to learn what filters should be used based on the data itself. Building a model of what true genetic variation looks like will allow us to rank-order variants based on their likelihood of being real. Each variant has a diverse set of statistics associated with them called variant annotations; real variants tend to cluster together via these statistics and tend to be Gaussianly distributed. Therefore a Gaussian mixture model can be fit to the data and new potential variants can be evaluated against this model. In short, variant annotations provide key information for identifying and removing artefacts.

Plot Variant quality score/Depth against evidence for strand bias

Understanding the resources parameter in VariantRecalibrator:

  • Training - use input variants that overlap with these training sites to build the model
  • Truth - use these truth sites to determine where to set the cutoff in VQSLOD sensitivity
  • Known - only for reporting purposes and not used in any calculations
  • Prior - Phred-scaled estimate of data accuracy

See https://gatk.broadinstitute.org/hc/en-us/articles/360036727711-VariantRecalibrator

Tranches - https://gatk.broadinstitute.org/hc/en-us/articles/360040098912-FilterVariantTranches

The tranches annotation can come from the CNNScoreVariants tool (CNNLOD), VQSR (VQSLOD), or any other variant scoring tool which adds numeric annotations in a VCF's INFO field. Tranches are specified in percent sensitivity to the variants in the resource files. For example, if you specify INDEL tranches 98.0 and 99.0 using the CNN_2D score the filtered VCF will contain 2 filter tranches for INDELS:

  • CNN_2D_INDEL_Tranche_98.00_99.00 and
  • CNN_2D_INDEL_Tranche_99.00_100.00.

Variants that scored better than the 98th percentile of variants in the resources pass through the filter and will have `PASS` in the filter field. We expect variants in the tranche CNN_2D_INDEL_Tranche_99.00_100.00 to be more sensitive, but less precise than CNN_2D_INDEL_Tranche_98.00_99.00, because variants in CNN_2D_INDEL_Tranche_99.00_100.00 have lower scores than variants in the tranche CNN_2D_INDEL_Tranche_98.00_99.00. The default tranche filtering threshold for SNPs is 99.95 and for INDELs it is 99.4. These thresholds maximise the F1 score (the harmonic mean of sensitivity and precision) for whole genome human data but may need to be tweaked for different datasets.

Filter variants either with VQSR or by hard-filtering

Site-level variant filtration refers to using the INFO field annotations for filtering.

Further reading

Which training sets / arguments should I use for running VQSR?

How to recalibrate variant quality scores.

Useful links

How can I access the GSA public FTP server? https://www.broadinstitute.org/gatk/guide/article?id=1215

Step-by-step tutorials that demonstrate how to use the tools in practice https://www.broadinstitute.org/gatk/guide/topic?name=tutorials