R data visualisation

Once upon a time, I made my graphs using Excel because it was the only software that I was aware of for making graphs. Now one can do amazing things with Excel and produce fairly good looking graphs, but after looking at some examples of R graphs, I wanted to learn a bit more about R data visualisation. This post mentions some new plotting functions I recently discovered and have found useful.

Using identify()

This lecture on R data visualisation provides a very nice introduction to producing plots in R, which I highly recommend, so check it out. One really cool trick I learned from the course was the use of identify(), which let's you identify points in a scatter plot. Imagine you were comparing the expression of genes from two different libraries and you wanted to know which gene a particular dot represented. Here's how:

Continue reading

Mapping repeats

Most eukaryotic genomes are interspersed with repetitive elements and some of these elements have transcriptional activity, hence they appear when we sequence the RNA population. From the trend of things, some of these elements seem to be important. One strategy for analysing these repeats is to map them to the genome, to see where they came from and from what repeat class they arose from. This post looks into mapping repeats and how sequencing accuracy can affect the mapping accuracy.

I will use the human genome as an example; according to RepeatMasker and Repbase, there are roughly 5,298,130 repetitive elements in the human genome. How much of the genome is that? First download the RepeatMasker results performed on hg19 from the UCSC Table Browser tool. I've downloaded the results as a bed file and named it hg19_rmsk.bed.

#the extremely useful stats program is available
cat ~/ucsc/hg19_rmsk.bed | perl -nle '@a=split; $s=$a[2]-$a[1]; print $s' | stats
Total lines:            5298130
Sum of lines:           1467396988
Ari. Mean:              276.965077867097
Geo. Mean:              168.518495379209
Median:                 188
Mode:                   21 (N=44789)
Anti-Mode:              3849 (N=1)
Minimum:                6
Maximum:                160602
Variance:               216904.549201035
StdDev:                 465.730124858845

In the above, I concatenated the entire bed file and redirected it to Perl, where it subtracted the end coordinate from the start, and outputted it into the stats program, where simple statistics were calculated. The total lines corresponded to the number of repetitive elements, which make up 1,467,396,988 bp of the hg19 genome. That's around half of the hg19 genome. Now to convert this bed file into a fasta file and randomly sample 5 million reads from the repeats.

Continue reading

Using the Bioconductor annotation packages

Another post related to this course I'm going through (I can't link it enough times). I have almost finished with the first day of the course and couldn't resist writing about this lecture on using the Bioconductor annotation packages. I had not realised that the annotation packages could be queried (pardon my ignorance) in the same manner as using SQL statements. Please see the lecture for more elaboration.

Continue reading

Using aggregate and apply in R

2016 October 13th: I wrote a post on using dplyr to perform the same aggregating functions as in this post; personally I prefer dplyr.

I recently came across a course on data analysis and visualisation and now I'm gradually going through each lecture. I just finished following the second lecture and the section "Working with dataframes and vectors efficiently" introduced to me the function called aggregate, which I can see as being extremely useful. In this post, I will write about aggregate, apply, lapply and sapply, which were also introduced in the lecture.

Let's get started with the ChickWeight dataset (see ?ChickWeight) available in the R datasets:

#load data
data <- ChickWeight
  weight Time Chick Diet
1     42    0     1    1
2     51    2     1    1
3     59    4     1    1
4     64    6     1    1
5     76    8     1    1
6     93   10     1    1

#dimension of the data
[1] 578   4

#how many chickens
 [1] 1  2  3  4  5  6  7  8  9  10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
[31] 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
50 Levels: 18 < 16 < 15 < 13 < 9 < 20 < 10 < 8 < 17 < 19 < 4 < 6 < 11 < 3 < 1 < 12 < ... < 48

#how many diets
[1] 1 2 3 4
Levels: 1 2 3 4

#how many time points
 [1]  0  2  4  6  8 10 12 14 16 18 20 21

ggplot(data=data, aes(x=Time, y=weight, group=Chick, colour=Chick)) +
       geom_line() +

chick_time_weight_line_as_charOver time the chickens got heavier.

Continue reading

Singular Vector Decomposition using R

In linear algebra terms, a Singular Vector Decomposition (SVD) is the decomposition of a matrix X into three matrices, each having special properties. If X is a matrix with each variable in a column and each observation in a row then the SVD is


where the columns of U are orthogonal (left singular vectors), the columns of V are orthogonal (right singluar vectors) and D is a diagonal matrix (singular values). Here I perform a SVD on the iris dataset in R.

Continue reading

Fitting a Michaelis-Menten curve using R

Many biological phenomena follow four different types of relationships that include sigmoid, exponential, linear and Michaelis-Menten type relationships. This is just a short post on fitting data to a Michaelis-Menten curve using the drc package for R available in CRAN.

#install if necessary

x <- seq(from=0,to=10e6,length=11)
 [1] 0e+00 1e+06 2e+06 3e+06 4e+06 5e+06 6e+06 7e+06 8e+06 9e+06 1e+07
y <- c(0, 61200.5, 102707.0, 138467.5, 171614.0, 202326.5, 230900.5, 258481.5, 284502.6, 309394.2, 333246.3)
df <- data.frame(x,y)
       x        y
1  0e+00      0.0
2  1e+06  61200.5
3  2e+06 102707.0
4  3e+06 138467.5
5  4e+06 171614.0
6  5e+06 202326.5
7  6e+06 230900.5
8  7e+06 258481.5
9  8e+06 284502.6
10 9e+06 309394.2
11 1e+07 333246.3

m1 <- drm(y ~ x, data = df, fct = MM.2())

Model fitted: Michaelis-Menten (2 parms)

Parameter estimates:

                Estimate Std. Error    t-value p-value
d:(Intercept) 8.1177e+05 5.0017e+04 1.6230e+01       0
e:(Intercept) 1.4721e+07 1.3435e+06 1.0957e+01       0

Residual standard error:

 4643.671 (9 degrees of freedom)

plot(m1,log='',xlim=c(0,5e8), ylim=c(0,1e6), xlab="Reads", ylab="Transcripts")

Continue reading