Thoughts on converting gene identifiers

If you've worked in the genomics field, then you've most likely spent some time converting gene identifiers to other gene identifiers or to gene symbols. Some common gene annotation databases include RefSeq, UCSC known genes, Entrez gene and Ensembl genes. In the past, I've relied on large tables that provide the lookup between different identifiers, such as the gene2refseq.gz file that provides lookups between Entrez gene IDs to RefSeq IDs.

However, more elegant and robust solutions exist such as BioMart, which is a freely available online service that allows you to download genomic annotations from several databases. The usefulness of the service is that you can easily perform lookups between several different databases, i.e. converting gene identifiers. There's a Perl package for working with BioMart and a R Bioconductor package. For this post, I will be using the R Bioconductor biomaRt package; I've previously written a short guide on learning to use biomaRt for those interested in a short introduction. The crux of biomaRt is summarised in the biomaRt vignette:

Continue reading

Sorting a huge BED file

I asked a question on Twitter about sorting a really huge file (more specifically sorting a huge BED file). To put really huge into context, the file I'm processing has 3,947,386,561 lines of genomic coordinates. I want the file to be sorted by the chromosome (lexical order), then by the start coordinate (numeric order) and then by the strand (by + and -), i.e. sort file.bed -k1,1 -k2,2n -k6,6.

To get started with a smaller file, download a CAGE dataset from ENCODE:

#download a CAGE bam file from ENCODE
wget -O test.bam
md5sum test.bam
895bcf6430707991d108281ddf1e8977  test.bam
samtools flagstat test.bam
23166381 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
23166381 + 0 mapped (100.00%:-nan%)
0 + 0 paired in sequencing
0 + 0 read1
0 + 0 read2
0 + 0 properly paired (-nan%:-nan%)
0 + 0 with itself and mate mapped
0 + 0 singletons (-nan%:-nan%)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

#convert to bed using BEDTools available at
bamToBed -i test.bam > test.bed
cat test.bed | wc -l
head -3 test.bed
chr1    10571   10598   HWUSI-EAS733_0011:1:62:3915:15245#0|AGA 1       -
chr1    10571   10598   HWUSI-EAS747_0013:3:40:11187:7108#0|AGA 1       -
chr1    10571   10598   HWUSI-EAS747_0013:3:60:14237:7545#0|AGA 1       -

Continue reading

Using GNU parallel

Updated 2015 August 3rd to include a section on formatting the output, which allows you remove two levels of file extensions.

I wrote this short guide on using GNU parallel for my biologist buddies who would like to harness the power of parallelisation. There are a lot of really useful guides out there but here I try to give simplistic examples. Let's get started by downloading the program and compiling it:

tar -xjf parallel-latest.tar.bz2
cd parallel-20131022/
#parallel will be compiled in the current directory
./configure --prefix=`pwd` && make && make install

Now adapting some of the examples from the GNU parallel tutorial:

#a basic example of running 3 echos in parallel
parallel echo ::: A B C

Continue reading

Analysing miRNA expression in cancers

MiRNAs are a class of small RNAs that when expressed usually down regulates the expression of its target transcript by binding to it and causing it to degrade or inhibiting it from being translated. There has been a lot of interest in studying the expression pattern of miRNAs, especially in relation to cancer, since their mis-regulation and aberrant expression, may cause aberrant expression of other transcripts. If for example, a miRNA regulating a oncogene (a gene that has the potential to promoter cancer when over expressed) is under expressed, then the oncogene may become over active and switch on cellular processes that promoter cancer. On the other hand, if a miRNA is aberrantly over expressed and its target is a gene that protects us from the over proliferation of cells, i.e. a tumour suppressor gene, then we lose this protective ability, from which a cancer may form.

There are thousands of known miRNAs and the aberrant expression of a specific miRNA may have different consequences and in the context of cancers, it may cause cancer in a different tissue. Thus to investigate which miRNAs are mis-expressed in various cancers, I downloaded 5 miRNA datasets from The Cancer Genome Atlas, corresponding to Acute Myeloid Leukemia (188 samples), Bladder Urothelial Carcinoma (208 samples), Brain Lower Grade Glioma (299 samples), Liver hepatocellular carcinoma (186 samples) and Lung adenocarcinoma (482 samples).

Continue reading