Sequence composition and random forests

Updated: 2013 November 28th

The sequence composition or the nucleotide composition at transcriptional starting sites (TSSs) of mRNAs are biased, i.e. certain nucleotides are preferred. Here I examine the sequence composition at the TSS of the NCBI Reference Sequence Database also known as RefSeq and use random forests to see if it's possible to train a classifier that can identify TSSs from random sequences. This work is done entirely using R and all code hosted on GitHub Gist.

Firstly, let's download the entire collection of RefSeqs using the R Bioconductor package biomaRt and create a data frame with the TSSs. Note that when working with transcripts, use the attributes 'transcript_start' and 'transcript_end', and not the attributes 'start_position' and 'end_position'; using those will give you the Ensembl gene coordinates.

Continue reading

Combinations and permutations in R

Time to get another concept under my belt, combinations and permutations. While I'm at it, I will examine combinations and permutations in R. As you may recall from school, a combination does not take into account the order, whereas a permutation does. Using the example from my favourite website as of late,

  • A fruit salad is a combination of apples, bananas and grapes, since it's the same fruit salad regardless of the order of fruits
  • To open a safe you need the right order of numbers, thus the code is a permutation

As a matter of fact, a permutation is an ordered combination. There are basically two types of permutations, with repetition (or replacement) and without repetition (without replacement).

Continue reading

Creating a coverage plot in R

Disclaimer (2015 August 5th): as pointed out in this comment thread below, this post created a density plot rather than a coverage plot. I have written a new post that uses BEDTools to calculate the coverage and R to produce an actual coverage plot.

I've recently discovered GitHub Gist, so for this post I'm going to use that to host my code (and all subsequent posts as I see fit). The code was not displaying properly due to some CSS property of the Twenty Ten theme, so I had to update my WordPress theme to Twenty Eleven, which also led me to changing my header image. The photo I used for the header was a shot I took at the summit of Mount Fuji around 06:00 on the 24th August 2013, when the clouds finally cleared a little; I hiked all night to make it to the top to see sunrise but unfortunately the weather was terrible. The photo looks nice, so I thought I'll use it as the header.

Anyway back to the topic; I wanted to create a coverage plot of mapped reads starting from a BAM file. So far I've been using IGV's coverage track to get a visual idea of the coverage. In the past, I've also used bedtools genomecov to generate bedGraph files and the subsequent wig and bigWig files that I would then visualise on the UCSC Genome Browser. How about creating a coverage plot in R (so that I can export it as a postscript file)? Yeah sure, why not. Let's download a BAM file as an example:

#the smallest CAGE BAM file from ENCODE

Continue reading

Getting started with Git

Git is a distributed version control and source code management (SCM) system with an emphasis on speed. What's version control? Version control is a system that records changes to a file or a set of files over time so that you can recall specific versions later. Here's an example: check out this tweet and the corresponding replies. It was a tweet regarding this scientist. If you read the latest version of the article there's nothing flamboyant (as stated in the tweet) about it because it has been edited since that tweet. However, if you wanted to see "the most glowing Wikipedia article written about any scientist", you can click view history on the article page and look at previous versions of the article. For example, through version control you can access this older version of the article; one that's definitely flashier than the current one.

Anyway back to the topic; Git is a Distributed Version Control Systems (DVCS), which means that clients don’t just check out the latest snapshot of the files: they fully mirror the repository. So what makes Git different from other version control systems? Quoting this guide:

The major difference between Git and any other Version Control Systems (VCS) is the way Git thinks about its data. Conceptually, most other systems store information as a list of file-based changes. These systems think of the information they keep as a set of files and the changes made to each file over time. Git doesn't think of or store its data this way. Instead, Git thinks of its data more like a set of snapshots of a mini file-system. Every time you commit, or save the state of your project in Git, it basically takes a picture of what all your files look like at that moment and stores a reference to that snapshot. To be efficient, if files have not changed, Git doesn't store the file again - just a link to the previous identical file it has already stored.

Continue reading

Handling big data in R

All credit goes to this post, so be sure to check it out! I'm just simply following some of the tips from that post on handling big data in R.

For this post, I will use a file that has 17,868,785 rows and 158 columns, which is quite big. Here's the size of this file:

#gzipped size
ls -sh file.gz
575M file.gz

#raw size
ls -sh file
5.7G file

Continue reading


I remember studying calculus in school and there were so many concepts that never clicked. I could solve the equations, find derivatives, work out the area under the curve, etc. but I didn't see the use of calculus, i.e. the application of calculus. I'm revisiting calculus now because I've been taking part in a biostatistics course offered freely by Coursera, which requires a working knowledge of calculus. The definition of calculus on Wikipedia is as such:

Calculus is the mathematical study of change, in the same way that geometry is the study of shape and algebra is the study of operations and their application to solving equations.

Continue reading