Genome Analysis Toolkit
The Genome Analysis Toolkit or GATK is a software package developed at the Broad Institute to analyze high-throughput sequencing data. The toolkit offers a wide variety of tools, with a primary focus on variant discovery and genotyping as well as strong emphasis on data quality assurance. Its robust architecture, powerful processing engine and high-performance computing features make it capable of taking on projects of any size.
All analyses done with the GATK typically involve several (though not necessarily all) of the following inputs:
- Reference genome sequence
- Sequencing reads
- Intervals of interest
- Reference-ordered data
(Yes I realise that I have mistakenly included the GATK Local Realignment step twice.)
- Introduction to Data Processing and Variant Detection for High-Throughput DNA Sequencing - https://www.broadinstitute.org/gatk/events/slides/1409/GATKwr5-BP-0A-Intro_to_HTS.pdf
- This presentation provides a brief overview on the concepts surrounding DNA sequencing
- Introduction to the GATK - https://www.broadinstitute.org/gatk/events/slides/1409/GATKwr5-BP-0B-Intro_to_GATK.pdf
- This presentation shows the best practices and explains what the GATK is
- Mapping and duplicate marking - https://www.broadinstitute.org/gatk/events/slides/1409/GATKwr5-BP-1-Map_and_Dedup.pdf
- Indel-based Realignment - https://www.broadinstitute.org/gatk/events/slides/1409/GATKwr5-BP-2-Realignment.pdf
- Base Quality Score Recalibration - https://www.broadinstitute.org/gatk/events/slides/1409/GATKwr5-BP-3-Base_recalibration.pdf
- Variant Calling and Genotyping - https://www.broadinstitute.org/gatk/events/slides/1409/GATKwr5-BP-4-Variant_calling_genotyping.pdf
- Variant Quality Score Recalibration - https://www.broadinstitute.org/gatk/events/slides/1409/GATKwr5-BP-5-Variant_recalibration.pdf
- Annotation and Phasing - https://www.broadinstitute.org/gatk/events/slides/1409/GATKwr5-BP-6-Annotation_and_phasing.pdf
- Analysing variant calls - https://www.broadinstitute.org/gatk/events/slides/1409/GATKwr5-BP-7-Prelim_variant_analysis.pdf
- Calling variants on RNAseq - https://www.broadinstitute.org/gatk/events/slides/1409/GATKwr5-X-1-Calling_RNAseq.pdf
- Prelude to variant calling: the power of cohorts - https://www.broadinstitute.org/gatk/events/slides/1409/GATKwr5-X-2-Calling%20cohorts.pdf
Variant Quality Score Recalibration
Variant Quality Score Recalibration (VQSR) is the process of assigning accurate confidence scores to each putative mutation call. The approach is to develop a continuous, covarying estimate of the relationship between SNP call annotations (e.g. QD, SB, HaplotypeScore, HRun) and the probability that a SNP is a true genetic variant versus a sequencing or data processing artefact. In a nutshell, the tool takes the overlap of training/truth resource sets of the call set and it models the distribution of these variants to the annotations specified and attempts to group them into clusters. The clustering is used to assign VQSLOD scores to all variants; variants that are closer to the heart of a cluster will get a higher score than variants that are outliers.
The idea is that it's better to learn what filters should be used based on the data itself. Building a model of what true genetic variation looks like will allow us to rank-order variants based on their likelihood of being real. Each variant has a diverse set of statistics associated with them called variant annotations; real variants tend to cluster together via these statistics and tend to be Gaussianly distributed. Therefore a Gaussian mixture model can be fit to the data and new potential variants can be evaluated against this model. In short, variant annotations provide key information for identifying and removing artefacts.
Plot Variant quality score/Depth against evidence for strand bias
Understanding the resources parameter in VariantRecalibrator:
- Training - use input variants that overlap with these training sites to build the model
- Truth - use these truth sites to determine where to set the cutoff in VQSLOD sensitivity
- Known - only for reporting purposes and not used in any calculations
- Prior - Phred-scaled estimate of data accuracy
The tranches annotation can come from the CNNScoreVariants tool (CNNLOD), VQSR (VQSLOD), or any other variant scoring tool which adds numeric annotations in a VCF's INFO field. Tranches are specified in percent sensitivity to the variants in the resource files. For example, if you specify INDEL tranches 98.0 and 99.0 using the CNN_2D score the filtered VCF will contain 2 filter tranches for INDELS:
- CNN_2D_INDEL_Tranche_98.00_99.00 and
Variants that scored better than the 98th percentile of variants in the resources pass through the filter and will have `PASS` in the filter field. We expect variants in the tranche CNN_2D_INDEL_Tranche_99.00_100.00 to be more sensitive, but less precise than CNN_2D_INDEL_Tranche_98.00_99.00, because variants in CNN_2D_INDEL_Tranche_99.00_100.00 have lower scores than variants in the tranche CNN_2D_INDEL_Tranche_98.00_99.00. The default tranche filtering threshold for SNPs is 99.95 and for INDELs it is 99.4. These thresholds maximise the F1 score (the harmonic mean of sensitivity and precision) for whole genome human data but may need to be tweaked for different datasets.
Filter variants either with VQSR or by hard-filtering
Site-level variant filtration refers to using the INFO field annotations for filtering.
How can I access the GSA public FTP server? https://www.broadinstitute.org/gatk/guide/article?id=1215
Step-by-step tutorials that demonstrate how to use the tools in practice https://www.broadinstitute.org/gatk/guide/topic?name=tutorials