A couple of weeks ago, I wrote a post on identifying OMIM phenotypes that are associated with a gene of interest. I thought I solved the problem by using one of my favourite R packages (biomaRt) but alas. For example, I could not find any OMIM IDs associated with the TTN gene using biomaRt. In the end, I resorted to using the OMIM API through a small R package I wrote called romim.
It has been almost two months since my last post; I have been occupied with preparing a fellowship application (which has been sent off!) and now I'm occupied with preparing and writing papers. Sadly, I've pushed blogging right down the priority list, even though it's one of the things I enjoy doing the most. This post is on exploring the variants that were discovered as part of the UK10K project. For the uninitiated, the UK10K project was a massive undertaking that aimed to characterise human genetic variation within the UK population by using whole exome (WES) and genome sequencing (WGS). The WGS arm sequenced healthy individuals (n=3,781) that were part of longitudinal studies and the WES arm sequenced individuals (n=5,294 and 5,182 passing QC) with rare diseases, severe obesity, and neurodevelopmental disorders. It's not quite 10K, but it's still an impressive number for now, since the 100,000 Genomes Project has already reached 7,306 genomes:
Latest numbers from the 100,000 Genomes Project – 7,306 genomes now sequenced https://t.co/X7hXcUFQRP
— Genomics England (@GenomicsEngland) March 7, 2016
After getting started and getting acquainted with DNA sequencing data, it's finally time to explore DNA variation. A tool that makes this easy is GEMINI and this post briefly demonstrates some of its functionality. I have only used GEMINI sparingly and what I know about the tool is gathered mostly from their documentation and tutorials. Be sure to check them out if you're planning on using GEMINI.
The SAMtools mpileup utility provides a summary of the coverage of mapped reads on a reference sequence at a single base pair resolution. In addition, the output from mpileup can be piped to BCFtools to call genomic variants. I'm currently working with some Sanger sequenced PCR products, which I would like to call variants on. There are various tools for variant detection on Sanger sequences but I wanted to take this opportunity to check out SAMtools mpileup and BCFtools. In this post, I illustrate the BWA-MEM, SAMtools mpileup, and BCFtools pipeline with a bunch of randomly generated sequences.
I want to compare the genotype concordance between two VCF files and I came across SnpSift, which seems to calculate the statistics that I want. However, the format of the results from my run differ from the format in the documentation. In this post, I will try to come up with the exact scenarios that fall into each of the summary statistics (to satisfy my curiosity).
Updated 2015 August 25th: as suggested by Tim, I checked out PLINK 1.9 and found it much simpler to convert PED into VCF. I updated the post with instructions for performing the conversion using PLINK 1.9.
Being late to the game of analysing genomic variants, I only recently discovered that IGV is capable of visualising VCF files; this is great if your variants are in the VCF. However, I have some PLINK files (a PED and MAP file), which I believe are not supported by IGV. After searching on the web, I came across a Python script written by Brad Chapman that seems to be able to convert PED files into VCF files. Since the script has several dependencies, this short post simply documents how and where these are downloaded.