Aligning to unique regions

Post updated on the 10th September 2013 after receiving input from the author of LAST

I've been interested in aligning reads to the repetitive portion of the human genome; in this post I'll look into how well different short read alignment programs perform when aligning to unique regions of the genome. Firstly to find unique regions in the genomes, I will use the CRG mappability track "wgEncodeCrgMapabilityAlign36mer.bigWig", which is available from the UCSC Genome Browser page. Have a look at my post on the ENCODE mappability tracks for more background information.

#the extremely useful stats program is available
cat wgEncodeCrgMapabilityAlign36mer.bedGraph | awk '$4 == 1 {print $3-$2}' | stats
Total lines:            15365566
Sum of lines:           2176355877
Ari. Mean:              141.638510224745
Geo. Mean:              15.0511330575994
Median:                 12
Mode:                   1 (N=1862327)
Anti-Mode:              6344 (N=1)
Minimum:                1
Maximum:                30513
Variance:               266606.806747189
StdDev:                 516.339817123558

From the above we see that:

  1. The longest stretch of unique nucleotides is 30513 + 35 bp
  2. The average length (arithmetic mean) of all unique regions is 142 + 35 bp
  3. There were 15,365,566 unique regions of variable length

I'm going to take unique regions that are longer than a certain number of bp and generate reads from these regions.

Continue reading

How deep should we sequence?

Updated 2013 November 12th.

High throughput sequencers are continually increasing their output of reads; according to Illumina, the HiSeq2500/1500 can output a maximum of 187 million single end reads per lane/flow cell. The question is "How deep should we sequence our samples?" Obviously it depends on the aim; if we wish to profile lowly expressed genes, the deeper the better. For this post I am interested in how deep I should sequence a CAGE library to reach saturation i.e. profiled all (or nearly all) RNA species in the sample. I had some samples that had a very low complexity, i.e. low number of unique reads, and I wanted to estimate the sequencing depth I would require to theoretically profile all the RNA species in my sample.

During my quest for an answer, I came across an interesting paper, which stated that for RNA sequencing:

While 100 million reads are sufficient to detect most expressed genes and transcripts, about 500 million reads are needed to measure accurately their expression levels.

I found this blog post that raised some issues about this estimate. Basically what the authors of the paper did was combine all 20 of the samples (each sample had roughly 40 million reads) together, totalling ~800 million and used this as the population. Obviously this is not the same as sequencing one sample to a depth of 800 million reads, since the samples are not homogeneous and thus the complexity is artificially higher. So in the end, this 500 million read estimation may be an over-estimation due to this inflated complexity. See the very well written blog post I linked above for more elaboration.

Continue reading

Predicting cancer

So far I've come across four machine learning methods, which includes random forests, classification trees, hierarchical clustering and k-means clustering. Here I use all four of these methods (plus SVMs) towards predicting cancer, or more specifically malignant cancers using the Wisconsin breast cancer dataset.

cat | head -5
cat breast-cancer-wisconsin.names | tail -25 | head -16
7. Attribute Information: (class attribute has been moved to last column)

   #  Attribute                     Domain
   -- -----------------------------------------
   1. Sample code number            id number
   2. Clump Thickness               1 - 10
   3. Uniformity of Cell Size       1 - 10
   4. Uniformity of Cell Shape      1 - 10
   5. Marginal Adhesion             1 - 10
   6. Single Epithelial Cell Size   1 - 10
   7. Bare Nuclei                   1 - 10
   8. Bland Chromatin               1 - 10
   9. Normal Nucleoli               1 - 10
  10. Mitoses                       1 - 10
  11. Class:                        (2 for benign, 4 for malignant)

Continue reading

ENCODE mappability and repeats

The ENCODE mappability tracks can be visualised on the UCSC Genome Browser and they provide a sense of how mappable a region of the genome is in terms of short reads or k-mers. On a side note, it seems some people use "mapability" and some use "mappability"; I was taught the CVC rule, so I'm going to stick with "mappability".

There are two sets of tracks, alignability and uniqueness, but what's the difference between the two? From the UCSC Genome Browser:

Alignability - These tracks provide a measure of how often the sequence found at the particular location will align within the whole genome. Unlike measures of uniqueness, alignability will tolerate up to 2 mismatches. These tracks are in the form of signals ranging from 0 to 1 and have several configuration options.

Uniqueness - These tracks are a direct measure of sequence uniqueness throughout the reference genome. These tracks are in the form of signals ranging from 0 to 1 and have several configuration options.

Let's take a look at two examples, where I try to use k-mers of similar size:

#download the files
#bigWigToBedGraph can be downloaded at
#convert into flat format
bigWigToBedGraph wgEncodeDukeMapabilityUniqueness35bp.bigWig wgEncodeDukeMapabilityUniqueness35bp.bedGraph
bigWigToBedGraph wgEncodeCrgMapabilityAlign36mer.bigWig wgEncodeCrgMapabilityAlign36mer.bedGraph
head wgEncodeDukeMapabilityUniqueness35bp.bedGraph
chr1    0       10145   0
chr1    10145   10160   1
chr1    10160   10170   0.5
chr1    10170   10175   0.333333
chr1    10175   10177   0.5
chr1    10177   10215   0
chr1    10215   10224   1
chr1    10224   10229   0.5
chr1    10229   10248   1
chr1    10248   10274   0
head wgEncodeCrgMapabilityAlign36mer.bedGraph
chr1    10000   10078   0.0013624
chr1    10078   10081   0.0238095
chr1    10081   10088   0.0185185
chr1    10088   10089   0.0147059
chr1    10089   10096   0.0185185
chr1    10096   10097   0.0238095
chr1    10097   10099   0.0185185
chr1    10099   10102   0.00172712
chr1    10102   10120   0.0013624
chr1    10120   10121   0.00172712

The bedGraph format simply associates a score (fourth column) to a particular region, as defined in the first three columns. At first glance you can already see the difference in the precision of the scores. The scores were calculated the same way, 1/the number of places a 35/36mer maps to the genome; however the uniqueness track will only keep scores for reads that map up to 4 places (0.25). So according to mappability track, the region chr1:10000-10035 will map to 734 places (since 1/734 =~ .0013624).

Continue reading

Using the ENCODE ChIA-PET dataset

Updated: 2014 March 14th

From the Wikipedia article:

Chromatin Interaction Analysis by Paired-End Tag Sequencing (ChIA-PET) is a technique that incorporates chromatin immunoprecipitation (ChIP)-based enrichment, chromatin proximity ligation, Paired-End Tags, and High-throughput sequencing to determine de novo long-range chromatin interactions genome-wide.

Let's get started on using the ENCODE ChIA-PET dataset by downloading the bed files, which has the interactions:


#get others if you want

For a preview and description of these interaction bed files, have a look at the table schema, which has the definition:

ChIA-PET Chromatin Interaction PET clusters: Two different genomic regions in the chromatin are genomically far from each other or in different chromosomes, but are spatially close to each other in the nucleus and interact with each other for regulatory functions. BED12 format is used to represent the data.

One cool way of visualising these chromtain interactions would be by using Circos (instructions for installing circos). Here's a simple Perl script to parse the bed12 files and prepare them in a format readable by Circos:

Continue reading