Managing papers

Update: 2015 August 19th: I discovered the "Retrieve Metadata from PDF" feature in Zotero that can retrieve all the metadata from a PDF and adds all the relevant information. (This doesn't work for all PDFs but for the majority it works. Note that if you perform too many queries, Google Scholar will restrict you from performing further searches.)

Since last year, I have been managing papers using Zotero following a recommendation from Casey (thanks again!):

Calculating the h-index

Updated 2014 September 19th to include a method that does not require sorting.

The h-index is an index that is calculated by measuring the number of articles, say $n$, that has at least $n$ citations. If you published 10 articles, and each of them had four citations, your h-index would be four, since there are four articles with at least four citations. I'm not interested in what this number represents or measures but instead I'm interested in how one would calculate the h-index given an array of numbers.

Reading the early release of "Bioinformatics Data Skills"

2014 August 11th: I noticed that some people are coming across this post by googling "bioinformatics data skills pdf". IMHO, this book is well worth the 40 dollars (price of the ebook) and I'm sure the author went through a lot of hard work to get it published, so please compensate him for his work.

This is going to be the first ever book commentary I've written that isn't for a school assignment (and also the first on this blog). I felt compelled to write it because I could relate to many of the issues highlighted in the early release version of the "Bioinformatics Data Skills" book and I wanted to share my thoughts. Just like in school, I wasn't aptly suited to review (nor deeply appreciate) classics such as "Lord of the Flies" or "Brave New World" but I could offer my interpretation based on what I gathered. Similarly, I'm not a bioinformatics expert but I can provide my thoughts on the book, for what it's worth.

Understanding the BAM flags

I've tried to explain the BAM flags to my colleagues and I think each time I have left them more confused. So perhaps I can do a better job of explaining BAM flags in writing. For this post, I will use this BAM file from the 1000 Genomes Project:

A small list of command line tips

Updated: 2014 May 14th; added even more tips

I'm in the middle of writing papers and my thesis, so I've been quite busy. However, I wanted to write a quick blog post as an outlet. So here's a list of random command line tips off the top of my head (GNU bash, version 4.1.2(1)-release); I hope that there's at least one tip in this list you didn't know about beforehand.

The find tool is extremely useful; some uses include:

#in all Perl files
#execute a grep quietly
#and look for human_id
#then report which file contains a match
find . -name '*.pl' -exec grep -q 'human_id' {} \; -print

#find broken symbolic links and delete them
find -L . -type l -delete


Randomly shuffle lines using shuf:

for i in {1..10}; do echo \$i; done | shuf
9
10
5
2
8
3
4
6
1
7


Use the -A and -B parameters of grep to print the lines before and after your matched line.

#echo out a bunch of lines
echo -e "3\n2\n1\nA\n1\n2\n3"
3
2
1
A
1
2
3

#show me two lines before and two lines after A
echo -e "3\n2\n1\nA\n1\n2\n3" | grep -B2 -A2 A
2
1
A
1
2

#also use -E with grep for extended regular expressions
#Basic vs Extended Regular Expressions
#       In basic regular expressions the meta-characters ?, +, {, |, (, and ) lose their special meaning; instead use the #backslashed versions \?, \+, \{, \|, $, and$.
#
#       Traditional egrep did not support the { meta-character, and some egrep implementations support \{ instead, so portable #scripts should avoid { in grep -E patterns and should use [{] to match a literal {.
#
#       GNU grep -E attempts to support traditional usage by assuming that { is not special if it would be the start of an invalid #interval specification.  For example, the command grep -E '{1' searches for the two-
#       character string {1 instead of reporting a syntax error in the regular expression.  POSIX.2 allows this behavior as an #extension, but portable scripts should avoid it.


Use watch to run a command every 2 seconds (default is 2 seconds):

#monitor the contents of a folder
#this can be useful for monitoring output files
watch ls -lt

#monitor the statuses of UGE hosts
watch qhost

#other monitoring tips
#use "tail -f" to monitor a file if it is growing
tail -f /var/log/apache/access.log

#check out the last 10 users who have logged into a server


Sorting a huge BED file

I asked a question on Twitter about sorting a really huge file (more specifically sorting a huge BED file). To put really huge into context, the file I'm processing has 3,947,386,561 lines of genomic coordinates. I want the file to be sorted by the chromosome (lexical order), then by the start coordinate (numeric order) and then by the strand (by + and -), i.e. sort file.bed -k1,1 -k2,2n -k6,6.

To get started with a smaller file, download a CAGE dataset from ENCODE:

#download a CAGE bam file from ENCODE
md5sum test.bam
895bcf6430707991d108281ddf1e8977  test.bam
samtools flagstat test.bam
0 + 0 duplicates
23166381 + 0 mapped (100.00%:-nan%)
0 + 0 paired in sequencing
0 + 0 properly paired (-nan%:-nan%)
0 + 0 with itself and mate mapped
0 + 0 singletons (-nan%:-nan%)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

#convert to bed using BEDTools available at https://code.google.com/p/bedtools/
bamToBed -i test.bam > test.bed
cat test.bed | wc -l
23166381
chr1    10571   10598   HWUSI-EAS733_0011:1:62:3915:15245#0|AGA 1       -
chr1    10571   10598   HWUSI-EAS747_0013:3:40:11187:7108#0|AGA 1       -
chr1    10571   10598   HWUSI-EAS747_0013:3:60:14237:7545#0|AGA 1       -


Using GNU parallel

Updated 2015 August 3rd to include a section on formatting the output, which allows you remove two levels of file extensions.

I wrote this short guide on using GNU parallel for my biologist buddies who would like to harness the power of parallelisation. There are a lot of really useful guides out there but here I try to give simplistic examples. Let's get started by downloading the program and compiling it:

wget http://ftp.jaist.ac.jp/pub/GNU/parallel/parallel-latest.tar.bz2
tar -xjf parallel-latest.tar.bz2
cd parallel-20131022/
#parallel will be compiled in the current directory
./configure --prefix=pwd && make && make install


Now adapting some of the examples from the GNU parallel tutorial:

#a basic example of running 3 echos in parallel
parallel echo ::: A B C
A
B
C