Querying PubMed using R

I've seen talks over the years where the speaker shows a bar chart with the number of articles in PubMed that contain a certain keyword and tallied per year. In most of the cases the speaker was trying to illustrate the growing number of articles that contain the keyword. Here I try to do the same by querying PubMed using R.

#install the RISmed package
install.packages("RISmed")
library(RISmed)

#now let's look up this dude called Dave Tang
res <- EUtilsSummary('dave tang', type='esearch', db='pubmed')

summary(res)
Query:
Tang, Dave[Full Author Name]

Result count:  10

#what are the PubMed ids for the Author Dave Tang?
QueryId(res)
[1] "23180801" "22976001" "22722852" "21888672" "21386911" "20510229" "19648138" "19501082" "19393063"
[10] "19270757"

#limit by date
res2 <- EUtilsSummary('dave tang', type='esearch', db='pubmed', mindate='2012', maxdate='2012')

summary(res2)
Query:
Tang, Dave[Full Author Name] AND 2012[EDAT] : 2012[EDAT]

Result count:  3

#three publications in 2012
QueryId(res2)
[1] "23180801" "22976001" "22722852"


Probability

The fundamental idea of inferential statistics is determining the probability of obtaining the observed data when we assume the null hypothesis is true. For example, if we roll a die 10 times and got 10 sixes, what is the probability of observing this result if we assume the null hypothesis that the die was fair? If the die is fair, the probability of getting 10 sixes in 10 rolls is $\frac{1}{6}^{10} = 1.653817e-08$, which is a very low probability. Since it's extremely unlikely that we observe 10 sixes on 10 rolls of a fair die by chance, we should reject the null hypothesis. This probability is the p-value.

Let's consider a less extreme case than the previous example. Here I will use the example from the first lecture of the Statistics for Neuroscience (9506) course, whereby a person (the lecturer) claimed that he had to ability to distinguish two different brands of espresso. Our null hypothesis in this case, is that the lecturer doesn't have the ability to distinguish the brands and is simply guessing. We come up with an experiment to test his ability by giving him 8 cups of espresso, where 4 are from brand A and the other 4 are from brand B, and ask him to separate them into two groups. If he managed to correctly group the 8 cups into their respective brands, what is the probability of getting this result if we assume the null hypothesis is true, i.e. what's the probability of getting this result just by chance?

Using Gviz

Updated: 2013 November 15th

A while ago I asked on Twitter, what are some tools that people use to visualise hundreds of bam files. One of the suggestions was Gviz (thanks Sebastian!) and I had a quick glimpse at the Bioconductor package and the plots looked really great! Here I use Gviz to plot features along a reference sequence and for visualising bam files.

From the vignette:

The Gviz package aims to provide a structured visualisation framework for plotting any type of data along genomic coordinates. The fundamental concept behind the Gviz package is similar to the approach taken by most genome browsers, in that individual types of genomic features or data are represented by separate tracks.

Transcription factor binding site prediction

Updated 2013 December 17th to include JASPAR

I have a simple task: given a short DNA sequence and I want to know if there are any potential transcription factor binding sites within this sequence. I looked online and found this transcription factor binding site prediction tool called TFSEARCH. It's very straight-forward; all you have to do is input a sequence, which may explain its popularity (based on the site's counter and Google's pagerank for the site).

TFSEARCH

So I decided to test the tool out by inputting a sequence that matches maximally for the Hunchback transcription factor:

GCATAAAAAA

This is the main output of TFSEARCH after some formatting:

Position weight matrix

The process of transcription, is influenced by the interaction of proteins called transcription factors (TFs) that bind to specific sites called Transcription Factor Binding Sites (TFBSs), which are proximal or distal to a transcription starting site. TFs generally have distinct binding preferences towards specific TFBSs, however TFs can tolerate variations in the target TFBS. Thus to model a TFBS, the nucleotides are weighted accordingly, to the tolerance of the TF. One common way to represent this is by using a position weight matrix (PWM), also called position-specific weight matrix (PSWM) or position-specific scoring matrix (PSSM), which is a commonly used representation of motifs (in our case TFBS) in biological sequences.

How do we find TFBSs? DNA sequences that interact with TFs can be experimentally determined from SELEX experiments. Since this process involves synthesis of a large number of randomly generated oligonucleotides, DNA sequences that interact with TFs can be determined, as well as the tolerance at specific sites. From SELEX experiments, a position frequency matrix (PFM) can be constructed by recording the position-dependent frequency of each nucleotide in the DNA sequence that interacted with the TF. Here's an example of a PFM as shown in this review "Applied bioinformatics for the identification of regulatory elements" (sorry paywall!):