One of the projects I have been involved with is SeqNextGen, where I’m analysing exomes of patients who have a suspected rare genetic disorder. It’s a change from what I was previously researching during my PhD; instead of working on an RNA level, I’ve reverse transcribed1 and I’m now examining DNA sequence and analysing genetic variants. There was a lot to learn to get started and I have written posts on “Getting started with analysing DNA sequencing data” and “Getting acquainted with analysing DNA sequencing data“. I guess this is part three of the series where I’m “Getting serious with analysing DNA sequencing data2.”
To begin, I’ll start with one of the first ever exome studies carried out for the diagnosis of a rare disease patient: “Making a definitive diagnosis: successful clinical application of whole exome sequencing in a child with intractable inflammatory bowel disease.” The patient was Nic Volker and he had a very serious medical condition that had baffled doctors. He had symptoms of Crohn’s disease but whatever he had, it didn’t behave like Crohn’s and therefore he was unresponsive to treatments for Crohn’s. The book “One in a Billion: The Story of Nic Volker and the Dawn of Genomic Medicine” provides a riveting chronicle of the entire case.
How did the team analysing Nic’s exome eventually come up with the causative mutation? Here’s the analysis pipeline they developed:
- Novelty, based on existence of the same variant (by position and nucleotide) in the dbSNP database (dbSNP build 129).
- Depth of coverage (derived from the Roche software output).
- Quality score (derived from the Roche software output).
- Amino acid physiochemical properties thought to be important in the determination of protein structure (for both the reference and variant amino acids of protein coding variants); e.g. charge, polarity, and size (standard amino acid property tables used).
- Class of change (synonymous, non-synonymous, stop codon etc.).
- Phylogenetic conservation based on UCSC PhastCons scores (providing a measure of the functional importance of the residue at that position in the protein; more highly conserved residues inferred as being more important to the function of the protein; a score of 0.9 used as highly conserved).
- Genic or genomic location (e.g. intronic, intergenic); based on comparison with the reference gene models from Entrez Gene.
- Zygosity (based on the number of reads that differ from the reference; 100% is defined as being homozygous, between 99% and 81% inclusive as probably homozygous, between 20% and 80% inclusive as heterozygous; lower than 20% were initially categorised as likely sequencing or assembly errors).
- Effects on splice sites (Gene Splicer tool run on the reference and variant containing DNA sequence in-house, output parsed).
- Polyphen score, prediction, and effect (algorithm uses structural and sequence information to predict impact of substitution on the structure and function of a protein; run in house).
- PDB structures for this protein or a related protein (derived from PolyPhen output).
- Online Mendelian Inheritance in Man disease association(s) for the gene containing the variant (identified using the OMIM disease to gene mapping tables from NCBI, and presented as disease names which are OMIM links).
- Protein annotation including protein ID, protein function, and description (obtained from RefSeq).
- Gene annotation including chromosomal location, gene name, unique identifiers, and gene function.
- Links to expression profiles derived from the GEO compendium (based on protein ID to expression profile mapping provided by NCBI).
The workflow above forms the basis of many current exome analysis pipelines. I guess there are three main aspects behind the steps: 1) filtering on variant quality (coverage and quality metric), 2) filtering on novelty (the disease variant shouldn’t be in public databases at a given frequency), and 3) filtering on functional prediction. Here’s another similar analysis workflow from Wang et al. 2010:
One tool that I have been using a lot for my exome work is GEMINI; I have written a blog post on “Getting started with GEMINI“. GEMINI relies on an annotation tool (either VEP or snpEff) to carry out the functional predictions. The built-in databases in GEMINI provide information on whether your variants are in public variant or disease databases, their zygosity (based on your VCF file), and whether they overlap protein domains, functional DNA regions, etc. With VEP and GEMINI, I can implement the above workflows with a single SQL command.
Two other complementary tools/scores that I use to assess genetic variants is the Residual Variation Intolerance Score (RVIS) and ExAC’s functional gene constraint scores. Briefly, they provide you with the number of variants you would expect to see in a gene; genes that are intolerant to variation (have less variants than expected) may be functionally more important than genes that are more tolerant to variation. The RVIS percentile scores are also available within GEMINI.
I also make use of clinical phenotypes associated with each case. The Human Phenotype Ontology (HPO) Project have standardised clinical phenotypes into an ontology and associated these ontologies with OMIM diseases. There are tools that make use of these ontologies, such as Phenomizer and Phenolyzer, and can associate HPO terms to genes. This way I can prioritise variants in genes that are associated with the clinical phenotypes of a patient.
At the end, I’m usually down to ~20 variants that I would like to closely examine. I am now a regular on GeneCards and OMIM. One thing I’ve learned is that a lot of these rare diseases lead to specific facies, developmental delays, and intellectual disabilities. I’m working hard on trying to identify key candidate variants in as many cases as possible though I should always remind myself that:
Have to accept the fact that I'm here to support medical decision making. Trying too hard to come up with a "diagnosis." #notarealdoctor
— Dave Tang (@davetang31) April 4, 2016
Each exome contains a lot of background variation and the task of finding the causative mutation has been described as finding needles in a stack of needles. Therefore, each exome workflow aims to narrow down the list of potential candidate variants using various criteria. However, the assumption that the causative variant lies in the exons may also be false, and you were trying to find something that didn’t exist in your data. Hence, we are moving away from exomes to whole genome sequencing and that’s another beast.
- Technically not true but I was trying to make a joke
- I didn’t want to make that the official title of this post though
This work is licensed under a Creative Commons
Attribution 4.0 International License.