Gene to OMIM Morbid Map

Update 2017 May 10th: I realised that this approach doesn't work for all genes, unfortunately. For example, the gene TTN (which is an HGNC approved gene symbol) is associated with 600334, 603689, 604145, 608807, 611705, and 613765 but biomaRt returns an NA. Please refer to an updated post.

I was interested in the number of Online Mendelian Inheritance in Man (OMIM) disorders a particular gene was associated with, which in this case was FGFR2. Once again it was biomaRt to the rescue. OMIM is a collection of genes and disorders, and the morbid map refers to the disorders. This post is on looking up the OMIM morbid IDs for FGFR2.

Continue reading

A single exome

In the age of 50,000+ and 60,000+ whole exome catalogues, it's hard to find processed data for a single exome. At least I had trouble trying to find a single VCF file for a single exome from one individual. After searching for a while, I gave up and decided to generate one myself. This post is on how I generated a single VCF file, which I have hosted on my web server.

Continue reading

ExAC allele frequency of pathogenic ClinVar variants

A continuation of the post on the genomic location of pathogenic ClinVar variants. For this post I will use vcfanno to annotate the ClinVar variants with the ExAC VCF file.

To get started, download the ExAC VCF file.

# 4.1G file
wget -c ftp://ftp.broadinstitute.org/pub/ExAC_release/release0.3.1/ExAC.r0.3.1.sites.vep.vcf.gz
wget -c ftp://ftp.broadinstitute.org/pub/ExAC_release/release0.3.1/ExAC.r0.3.1.sites.vep.vcf.gz.tbi

Continue reading

Genomic location of pathogenic ClinVar variants

How many pathogenic ClinVar variants are in intergenic regions? I'll define genomic regions as per this old post. To get started, download the latest ClinVar variants:

wget -c ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh37/clinvar_20170104.vcf.gz

# index
tabix -p vcf clinvar_20170104.vcf.gz

# how many variants?
zcat clinvar_20170104.vcf.gz | grep -v "^#" | wc -l
232624

Continue reading

Assessing genetic variants

One of the projects I have been involved with is SeqNextGen, where I'm analysing exomes of patients who have a suspected rare genetic disorder. It's a change from what I was previously researching during my PhD; instead of working on an RNA level, I've reverse transcribed1 and I'm now examining DNA sequence and analysing genetic variants. There was a lot to learn to get started and I have written posts on "Getting started with analysing DNA sequencing data" and "Getting acquainted with analysing DNA sequencing data". I guess this is part three of the series where I'm "Getting serious with analysing DNA sequencing data2."

Continue reading

Exploring the UK10K variants

It has been almost two months since my last post; I have been occupied with preparing a fellowship application (which has been sent off!) and now I'm occupied with preparing and writing papers. Sadly, I've pushed blogging right down the priority list, even though it's one of the things I enjoy doing the most. This post is on exploring the variants that were discovered as part of the UK10K project. For the uninitiated, the UK10K project was a massive undertaking that aimed to characterise human genetic variation within the UK population by using whole exome (WES) and genome sequencing (WGS). The WGS arm sequenced healthy individuals (n=3,781) that were part of longitudinal studies and the WES arm sequenced individuals (n=5,294 and 5,182 passing QC) with rare diseases, severe obesity, and neurodevelopmental disorders. It's not quite 10K, but it's still an impressive number for now, since the 100,000 Genomes Project has already reached 7,306 genomes:

Continue reading