Four and a half months ago, I wrote a post on Getting started with analysing DNA sequencing data. I wrote about the R packages SNPRelate and VariantAnnotation, PLINK file formats, and various tidbits; it was all over the place and because of that, it was easy to write. This post is meant to be a sequel and has been a work in progress for months because I've been trying to accumulate as much information as possible to have a more complete post. However, this post is still far from complete but I had to draw the line somewhere. It tries to fill in some of the gaps of the first post as well as touching on issues of variant annotation and variant filtration.
One of the main goals of a project I'm involved in is to come up with a diagnosis pipeline, i.e. trying to identify causative variants that contributes mechanistically to disease. A must read article for those working in this space is "Guidelines for investigating causality of sequence variants in human disease." In addition, check out the book Exploring Personal Genomics, which is well worth the 60 dollars. If you do not proceed any further with this post (since it is quite long), do check out those two references.
Just last night I found this educational mini game written in R and decided to have a go at it:
— Dave Tang (@davetang31) December 5, 2015
I completed it but as I alluded to in my tweet, not in a very elegant manner. This post is on using the dplyr package in R to solve some of the problems. If you want to give the game a go first, then stop reading now.