Getting acquainted with analysing DNA sequencing data

Four and a half months ago, I wrote a post on Getting started with analysing DNA sequencing data. I wrote about the R packages SNPRelate and VariantAnnotation, PLINK file formats, and various tidbits; it was all over the place and because of that, it was easy to write. This post is meant to be a sequel and has been a work in progress for months because I've been trying to accumulate as much information as possible to have a more complete post. However, this post is still far from complete but I had to draw the line somewhere. It tries to fill in some of the gaps of the first post as well as touching on issues of variant annotation and variant filtration.

One of the main goals of a project I'm involved in is to come up with a diagnosis pipeline, i.e. trying to identify causative variants that contributes mechanistically to disease. A must read article for those working in this space is "Guidelines for investigating causality of sequence variants in human disease." In addition, check out the book Exploring Personal Genomics, which is well worth the 60 dollars. If you do not proceed any further with this post (since it is quite long), do check out those two references.

In my first post, I wrote about the GATK, which is a popular toolkit used for calling variants, i.e. identifying DNA sequences that differ from a reference genome. Once this has been performed, the next step is to annotate the variants with respect to various resources. For example, there are computational tools that annotate where a variant is located with respect to known gene models, predict the functional effect (if any) of a variant, and perform all sorts of overlaps to various databases. We have a lot of natural variation in our genomes and for my work, I am interested in studying the variants which may be associated with pathogenicity. Based on our variant annotation, we can define a strategy for prioritising the list of variants that warrant further and closer investigation. But before I get into all that goodness, I'll start with an update of some issues I've learned about the Variant Call Format (VCF).

The Variant Call Format

Much like SAM/BAM/CRAM files are the de facto standard for representing alignments of high-throughput sequencing reads, the VCF is the de facto standard for representing variants. However, as with any file format, there are certain caveats and here are some that pertain to VCF files that I have learned about. An article written by the author of ANNOVAR (a popular tool used for variant annotation) raised two main issues, which is that multiple variants can be represented on a single line of a VCF file and the representation of insertions and deletions (INDELs) is not consistent. ANNOVAR introduces its own file format to ensure that multi-allelic sites are represented on separate lines and INDELs are consistently represented. This also mitigates problems that arise when SNPs and INDELs are represented on the same line.

The paper "Unified representation of genetic variants" formally addresses these issues. The authors describe the concept of variant decomposition and normalisation, whereby multi-allelic sites are separated on multiple lines and INDELs are left-aligned and represented in the most parsimonious manner, respectively. They wrote the vt program to carry out the variant decomposition and normalisation steps on VCF files (among other things). As touched upon in the VCF article by the ANNOVAR author, it's important that variants are represented in a unified way for annotation purposes. Say you wanted to find out if your variant has been previously reported in the 1000 Genomes Project; if the same variant is reported in a different manner in the 1000 Genomes Project, some annotation software will report a miss. Furthermore, when comparing VCF files, some programs such as SnpSift will not compare variants unless the REF and ALT columns are identical among the two VCF files. The GEMINI software requires that VCF files have been decomposed and normalised, before they are processed.

Variant annotation

Variant annotation is exactly what the name implies; it is the annotation of variants. However, it is usually associated with the functional rather than technical annotation of variants (the latter is the context in which the GATK uses it in). Variants can be annotated with respect to different features present in a genome such as gene models, and to various "genome tracks" such as conservation or mappability tracks. If a variant lies within a coding region of a gene, it may be further annotated by its functional consequence; a variant may cause a synonymous mutation or introduce a new stop codon.

It seems that there are three main variant annotation tools that people use: ANNOVAR, SnpEff, and VEP. As far as I can tell, there is a lot of functional overlap between the three tools, i.e. they can do the same thing, however they do not always produce the same annotation. A study that compared ANNOVAR and VEP annotations found varying rates of concordance depending on the annotation category. For example, for loss of function annotations, the overall percentage match rate was only around 65%; the main reason being that ANNOVAR and VEP used different transcripts for variant annotation.

My conclusion for variant annotation is no matter which tool you end up using, make sure you know the idiosyncrasies of your tool of choice. For example, ANNOVAR will only return one annotation per variant based on a priority scheme by default, whereas VEP will return an annotation per transcript. I read through the entire ANNOVAR documentation and wrote notes. However, given that GEMINI only processes SnpEff or VEP annotated VCF files, I have decided to use VEP as my variant annotation tool of choice. (I have been using GEMINI and I will write a separate post on it sometime in the future.)

Global genetic maps

Variants are also typically annotated to various "global genetic maps", which are databases that have captured human genetic variation. One of the first projects that created a global genetic map was the International HapMap project, which had the goal of identifying haplotypes amongst different populations. One of the goals of creating HapMap was to associate haplotypes to genetic diseases that are more prevalent among specific populations. These genetic maps have also been used to (re)construct human evolutionary trees.

Two pioneering studies with exome sequences (Exome sequencing identifies the cause of a mendelian disorder and Exome sequencing identifies MLL2 mutations as a cause of Kabuki syndrome) demonstrates another utility of using these global genetic maps: to discover the causal variant behind patients suffering from a suspected Mendelian disorder. Mendelian disorders are single gene disorders that are inherited in a dominant, recessive, or X-linked pattern and are highly penetrant (the variant alone is sufficient to cause disease). Therefore it is unlikely that the causative variant is present in global genetic maps created from individuals without the disease phenotype under investigation. These maps can also be used with respect to identifying variants associated with rare diseases; the idea is that if a variant is found at a relatively high frequency in the population, it is probably not the causative variant since this is inconsistent with the prevalence of the disease.

There have been other maps since HapMap, including but not limited to the 1000 Genomes Project (1KGP), the UK10K project, the NHLBI GO Exome Sequencing Project (ESP), and the Exome Aggregation Consortium (ExAC) data aggregation project. The first two projects include whole genome and exome sequencing, whereas the last two only consist of exome sequencing. Below are some statistics I gathered from some of these projects, which provide a general idea of genetic variation among humans. It should be pointed out that the UK10K, ESP, and ExAC databases contain individuals with disease phenotypes. It is my understanding that the 1KGP samples have no identifying or phenotype information available.

From the latest 1KGP paper, which was actually based on 2,504 individuals, they found that a typical genome contains around 149-182 sites with protein truncating variants, 10,000 to 12,000 sites with peptide sequence altering variants, and 459,000 to 565,000 variants lie within known regulatory regions. In addition, around 2,000 variants per genome were associated with complex traits through genome-wide association studies and 24-30 variants per genome were implicated in rare diseases through ClinVar (a database that reports the relationships among human variations and phenotypes). The title of this paper illustrates what we're up against when searching for disease causing variants amongst so much background.

The UK10K project includes two main project arms: the UK10K-cohorts (WGS of healthy individuals) arm and the UK10K-exomes arm (WES of individuals with rare disease, severe obesity, and neurodevelopmental disorders). From the 3,781 individuals who had their genomes fully sequenced, each individual had on average 3,222,597 SNVs (5,073 private), 705,684 INDELs (295 private) and 215 large deletions (less than 1 private). When the UK10K consortium analysed the variation within protein-coding genes (18,903), they found that 576 genes contained at least one homozygous or compound heterozygous variant predicted to result in the loss of function of a protein. While these healthy individuals could have undiagnosed genetic disorders, they were still deemed as healthy individuals.

The ESP was focused on the study of heart, lung, and blood disorders with the goal of identifying rare and putatively functional variants associated with these disorders. From the ESP6500 dataset (consisting of 6,515 individuals) a single nucleotide variant (SNV) was observed approximately once every 52 base pairs (bp) and 57 bp in European Americans and African Americans, respectively. Whereas in the ExAC data, which included 3,936 of the ESP individuals and had a total of 60,706 individuals, an average of one variant was found at every eight bases of coding sequence (of which 54% were variants observed only once in the data set but were of high quality). Furthermore, of the 7,404,909 high quality variants, 72% were absent from both the 1KGP and ESP.

Variant filtering

I see variant filtration as a step to remove variants that are either suspected of being technical artefacts or background mutations that are putatively benign. Given the flow of data analysis, the removal of technical artefacts is carried out first. The Variant Quality Score Recalibration step in the GATK pipeline is one way of assigning confidence to variants through the VQSLOD score that can be used as filtering metric. The method consists of training a model based on a set of known variants (typically from the 1KGP and HapMap data) and their variant annotations, and using this model to score all other variants. The GATK also includes a tool, aptly named, VariantFiltration that can be used to filter variants based on filters you define. For example:

java -jar GenomeAnalysisTK.jar \
   -T VariantFiltration \
   -R reference.fasta \
   -o output.vcf \
   --variant input.vcf \
   --clusterSize 3 \
   --clusterWindowSize 35 \
   --filterName "LowCoverage" \
   --filterExpression "DP<5" \
   --filterName "LowQual" \
   --filterExpression "QUAL>30.0&&QUAL<50.0" \
   --filterName "VeryLowQual" \
   --filterExpression "QUAL<30.0" \
   --filterName "LowConfidence" \
   --filterExpression "QD<1.5"

The VariantFiltration step will label the variant in the VCF file according to the filterName you define. The clusterSize and clusterWindowSize filter labels variants if n number of them occur in the specified window. The question I was always asking is how do we come up set of filters? It seems arbitrary. For example, in O'Rawe et al. they decided to use a depth of 10 or more and with more than four reads supporting an INDEL for defining high-confidence INDELs. And here were some filters used in Gudbjartsson et al.: SNPs were discarded if at least one of the following thresholds for their GATK call annotation parameter was violated: QD < 2.0, MQ < 40.0, FS > 60.0, HaploTypeScore > 13.0, MQRankSum < -12.5, and ReadPosRankSum < -8.0. INDELs were discarded if at least one of these inequalities were satisfied: QD < 2.0, FS > 200.0, or ReadPosRankSum < -20.0.

Reumers et al. mentioned the lack of consensus among filtering regimes in the introduction of their paper and initiated a systematic study to come up with a comprehensive set of filters based on whole genome sequencing of monozygotic twins carried out by Complete Genomics (CG). When they examined their list of SNVs that were discordant between the two genomes, they found that they were located near INDELs or a cluster of SNVs, were in loci with low or extremely high coverage, and were located in repetitive DNA regions. They found that filtering SNVs located in simple tandem repeat regions was one of the most effective filtering steps due to the specific read structure inherent with CG sequencing. This seems like a good filtering strategy to implement in general and I noticed it was also used by Mike Lin, who had his own genome sequenced and is analysing his own data. Specifically he used BEDTools to remove variants overlapping regions of the reference genome with low sequence complexity.

The other aspect of variant filtering pertains to narrowing down a large list of variants to a smaller potentially more interesting subset. The interest may be in trying to identify a list of candidate variants that are causative for the phenotype of study or in trying to identify putatively functional variants. Here's a schematic from the ANNOVAR paper that illustrates this process:


Whole Genome Sequencing versus Whole Exome Sequencing

Here are some articles that I found informative regarding the comparison of Whole Genome Sequencing (WGS) versus Whole Exome Sequencing (WES):

I think in due time, we'll just be doing WGS.


In this post, I wanted to illustrate some issues associated with the VCF, mention some tools that are used for variant annotation and filtration, and highlight the extent of human genetic variation. My next post will be on using GEMINI to explore this variation in the context of various publicly available databases.

Print Friendly, PDF & Email

Creative Commons License
This work is licensed under a Creative Commons
Attribution 4.0 International License
5 comments Add yours
  1. This link,, was referred to me by a colleague – it is focused on discerning somatic mutations from germline ones. And has also a couple of excellent references about SNP annotation and quality filtering. Yes, selection of filtering thresholds is not standardized, but this post, hopefully, will help others to select the right filters, depending on research question.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.