Aligning to unique regions

Post updated on the 10th September 2013 after receiving input from the author of LAST I’ve been interested in aligning reads to the repetitive portion of the human genome; in this post I’ll look into how well different short read alignment programs perform when aligning to unique regions of the genome. Firstly to find unique…

Continue Reading

How deep should we sequence?

Updated 2013 November 12th. High throughput sequencers are continually increasing their output of reads; according to Illumina, the HiSeq2500/1500 can output a maximum of 187 million single end reads per lane/flow cell. The question is “How deep should we sequence our samples?” Obviously it depends on the aim; if we wish to profile lowly expressed…

Continue Reading

Predicting cancer

So far I’ve come across four machine learning methods, which includes random forests, classification trees, hierarchical clustering and k-means clustering. Here I use all four of these methods (plus SVMs) towards predicting cancer, or more specifically malignant cancers using the Wisconsin breast cancer dataset.

Continue Reading

ENCODE mappability and repeats

The ENCODE mappability tracks can be visualised on the UCSC Genome Browser and they provide a sense of how mappable a region of the genome is in terms of short reads or k-mers. On a side note, it seems some people use “mapability” and some use “mappability”; I was taught the CVC rule, so I’m…

Continue Reading

Using the ENCODE ChIA-PET dataset

Updated: 2014 March 14th From the Wikipedia article: Chromatin Interaction Analysis by Paired-End Tag Sequencing (ChIA-PET) is a technique that incorporates chromatin immunoprecipitation (ChIP)-based enrichment, chromatin proximity ligation, Paired-End Tags, and High-throughput sequencing to determine de novo long-range chromatin interactions genome-wide. Let’s get started on using the ENCODE ChIA-PET dataset by downloading the bed files,…

Continue Reading