Updated: 2013 November 28th
The sequence composition or the nucleotide composition at transcriptional starting sites (TSSs) of mRNAs are biased, i.e. certain nucleotides are preferred. Here I examine the sequence composition at the TSS of the NCBI Reference Sequence Database also known as RefSeq and use random forests to see if it's possible to train a classifier that can identify TSSs from random sequences. This work is done entirely using R and all code hosted on GitHub Gist.
Firstly, let's download the entire collection of RefSeqs using the R Bioconductor package biomaRt and create a data frame with the TSSs. Note that when working with transcripts, use the attributes 'transcript_start' and 'transcript_end', and not the attributes 'start_position' and 'end_position'; using those will give you the Ensembl gene coordinates.