Transcriptome feed using R

I’ve always wanted to create a transcriptome feed on Twitter, posting the results of daily PubMed searches. Well today I finally got around to it. Firstly, I made a new Twitter account; annoyingly all the Twitter handles I wanted were taken by inactive users. I decided to go with @transcriptomes. Next, I made a new Twitter application that’s associated with my new Twitter account (I set permissions to “Read, Write and Access direct messages”), and I set it up so that I could use twitteR to communicate with this app. For this post, I’m using OS X 10.10.1 on a MacBook Air.

#install the package
install.packages("twitteR")
#load the package
library("twitteR")

#to get your consumerKey and consumerSecret see the twitteR documentation for instructions
consumer_key <- 'secret'
consumer_secret <- 'secret'
access_token <- 'secret'
access_secret <- 'secret'
setup_twitter_oauth(consumer_key,
                    consumer_secret,
                    access_token,
                    access_secret)

#send first tweet
updateStatus("It's alive!")

its_aliveThings are in order.

PubMed searches

I used the RISmed package to perform PubMed queries.

#install package
install.packages("RISmed")
#load the package
library(RISmed)

I created a simple search, which looks for articles with the keyword “transcriptome” associated with it and have been deposited in the repository since yesterday.

#Get summary information on the results of a query
#the reldate parameter limits results
#to articles deposited since one day ago
summary <- EUtilsSummary('transcriptome', type='esearch', db='pubmed', reldate=1)

#download results of a query
result <- EUtilsGet(summary)

#hard limit of 50
my_limit <- 50
if(QueryCount(summary) <= my_limit){
  my_limit <- QueryCount(summary)
}

#loop through the results
for (i in 1:my_limit){
  #PubMed ID
  my_id <- QueryId(summary)[i]
  #title of paper
  my_title <- ArticleTitle(result)[i]
  #tweets have a 140 char limitation
  if(nchar(my_title) > 93){
    my_title <- substr(my_title, start=0, stop=93)
    my_title <- paste(my_title, '...', sep='')
  }
  #create URL that links to the paper
  my_url <- paste('http://www.ncbi.nlm.nih.gov/pubmed/', my_id, sep='')
  #create my tweet
  my_tweet <- paste(my_title, my_url)
  #sleep
  Sys.sleep(2)
  #tweet the paper!
  updateStatus(my_tweet)
}

Setting up cron

I want to perform this search automatically each day. Below is the cron job I set up; it runs the feed.R script every hour. I set it up this way because I don’t leave my laptop on all the time. There’s a very high probability that I leave my laptop on for over an hour, so it should get run at least once.

Update: I changed the cron job to run hourly after 15:00 (GMT+9); this makes it so that the current day on the NCBI server matches my current day.

crontab -l
#minute hour dom month dow user cmd
0 15-23 * * * cd /Users/davetang/Dropbox/transcriptomes && ./feed.R &> /dev/null

I don’t want to tweet the results again if I’ve already tweeted about it. To prevent that, the feed.R script simply looks for a file, which is named according to the date (YYYYMMDD), and if it exists the script will quit. The contains of this file are the results of the PubMed search, which I wanted to save anyway.

cat feed.R 
#!/usr/bin/env Rscript

library("twitteR")
library(RISmed)

load("twitter_authentication.Rdata")
registerTwitterOAuth(cred)

today <- Sys.Date()
today <- format(today, format="%Y%m%d")

if(file.exists(today)){
  quit()
}

summary <- EUtilsSummary('transcriptome', type='esearch', db='pubmed', reldate=1)

result <- EUtilsGet(summary)

my_limit <- 50
if(QueryCount(summary) <= my_limit){
  my_limit <- QueryCount(summary)
}

for (i in 1:my_limit){
  my_id <- QueryId(summary)[i]
  my_title <- ArticleTitle(result)[i]
  if(nchar(my_title) > 93){
    my_title <- substr(my_title, start=0, stop=93)
    my_title <- paste(my_title, '...', sep='')
  }
  my_url <- paste('http://www.ncbi.nlm.nih.gov/pubmed/', my_id, sep='')
  my_tweet <- paste(my_title, my_url)
  #delay the tweeting by 3 seconds
  Sys.sleep(3)
  updateStatus(my_tweet)
}

#save today's summary
save(summary, file = today)

quit()

And that’s it! I’ll keep an eye on @transcriptomes to see if any problems come up.

tweets

Print Friendly, PDF & Email



Creative Commons License
This work is licensed under a Creative Commons
Attribution 4.0 International License
.
3 comments Add yours
  1. Can’t believe I missed this (though probably as you didn’t tweet a link to this post!), I ended up doing a convoluted version of what you have for Google Scholar Alerts just this week. I thought I’d spotted a mistake in your calculation of link length, but I’m kinda horrified to see that dlvr.it truncates paper titles excessively in all the existing Twitter bots, making titles 20-odd characters less readable than needs be :-/

    I feel like link shorteners are bad practice for science online anyway, since it makes any data mining exercise that bit more difficult.

    The character limit on Twitter’s links is deceptive because t.co shortening is applied – at present, https protocol links “count for” 23 characters, http 22, so my equivalent of your if nchar(my_title) > 93) is to pass the URL into a function InCharLimit which returns the title’s character limit:

    InCharLimit <- function(tweet.url.string = '') {
    # Cautious: assume link will be longest possible (https, 23 characters)…
    url.char.count <- https.chars <- 23L
    http.chars <- 22L

    # …unless it is proven otherwise
    if (confirmed.http <- grepl('http://&#039;,tweet.url.string))
    url.char.count <- http.chars

    return(title.char.limit <- 140L – url.char.count – 1)
    }
    # when calling AbbrevTitle on the title such that the URL (hence char. lim.) is taken into account

    AbbrevTitle <- function(start.str, known.url = NULL, use.abbreviations = T, max.compact = T, above.env = parent.frame()) {

    if (!is.null(known.url)) char.limit <- InCharLimit(known.url) else char.limit = 116L

    working.title char.limit) {

    # abbreviation algorithm attempts to get below character limit…

    }

    Also I don’t think sleep(3) is necessary, rate limit windows are over 15 minute intervals and for GET not POST [e.g. update status] requests. Updating a status isn’t API limited, just limited as a normal account would be, to 2,400 tweets per day, “broken down into semi-hourly limits” (not necessarily 50 per half hour), so sleeping for 3 seconds wouldn’t make a difference – 30s, 5m, 30m, … recursive sleep seems to be the company recommendation on forums.

    Thanks for sharing that cron script, that’s one of my next things to organise. Feel free to check out my version on GitHub 🙂

    I’ll probably switch from JSON to RData storage of ‘seen’ message IDs in my code too, first I’ve seen of the format here.

    1. Hi Louis,

      thanks (again) for the detailed comment! You are right about the length condition; I did realise that twitter uses t.co shortening, so I could have maximised the title. But I just went with the easiest (laziest) approach.

      As you mentioned in your tweet, I would also prefer a non-cron approach because there are days when I don’t have Internet connection. I’ll look into AWS Lambda.

      Cheers,

      Dave

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.