One of the projects I have been involved with is SeqNextGen, where I'm analysing exomes of patients who have a suspected rare genetic disorder. It's a change from what I was previously researching during my PhD; instead of working on an RNA level, I've reverse transcribed1 and I'm now examining DNA sequence and analysing genetic variants. There was a lot to learn to get started and I have written posts on "Getting started with analysing DNA sequencing data" and "Getting acquainted with analysing DNA sequencing data". I guess this is part three of the series where I'm "Getting serious with analysing DNA sequencing data2."
I have recently entered new territory and started working on analysing DNA sequencing data (as opposed to analysing RNA sequencing, i.e. transcriptomics). While it is still the analysis of lots of sequencing reads, one of the goals is to identify disease causing mutations; this is in contrast to identifying differentially expressed genes or inferring gene networks. The new playground includes an entirely new repertoire of bioinformatic tools, file formats, and genetics that I was unaware of. While it is always fun to learn about new things, time is always a constraint. In this post, I write about some of the tools and file formats I learned about, and hopefully they will be useful for those who are getting started with analysing DNA sequencing data.
I've wanted to write this post for a while, but I never had to work with paired-end libraries, so the impetus wasn't quite there. Finally I've decided to take a look at some paired-end libraries at work and as usual, I will test some simple examples before I touch the real data. For those not familiar with paired-end reads, check out this post; it has very nice and simple illustrations, along with explanations on the terminology used in paired-end sequencing. Now let's get started!
Updated 2013 November 12th.
High throughput sequencers are continually increasing their output of reads; according to Illumina, the HiSeq2500/1500 can output a maximum of 187 million single end reads per lane/flow cell. The question is "How deep should we sequence our samples?" Obviously it depends on the aim; if we wish to profile lowly expressed genes, the deeper the better. For this post I am interested in how deep I should sequence a CAGE library to reach saturation i.e. profiled all (or nearly all) RNA species in the sample. I had some samples that had a very low complexity, i.e. low number of unique reads, and I wanted to estimate the sequencing depth I would require to theoretically profile all the RNA species in my sample.
During my quest for an answer, I came across an interesting paper, which stated that for RNA sequencing:
While 100 million reads are sufficient to detect most expressed genes and transcripts, about 500 million reads are needed to measure accurately their expression levels.
I found this blog post that raised some issues about this estimate. Basically what the authors of the paper did was combine all 20 of the samples (each sample had roughly 40 million reads) together, totalling ~800 million and used this as the population. Obviously this is not the same as sequencing one sample to a depth of 800 million reads, since the samples are not homogeneous and thus the complexity is artificially higher. So in the end, this 500 million read estimation may be an over-estimation due to this inflated complexity. See the very well written blog post I linked above for more elaboration.