Thoughts on converting gene identifiers

If you’ve worked in the genomics field, then you’ve most likely spent some time converting gene identifiers to other gene identifiers or to gene symbols. Some common gene annotation databases include RefSeq, UCSC known genes, Entrez gene and Ensembl genes. In the past, I’ve relied on large tables that provide the lookup between different identifiers,…

Continue Reading

Sorting a huge BED file

I asked a question on Twitter about sorting a really huge file (more specifically sorting a huge BED file). To put really huge into context, the file I’m processing has 3,947,386,561 lines of genomic coordinates. I want the file to be sorted by the chromosome (lexical order), then by the start coordinate (numeric order) and…

Continue Reading

Using GNU parallel

Updated 2020 February 26th to include section “Strip directory and extensions”. I wrote this short guide on using GNU parallel for my biologist buddies who would like to harness the power of parallelisation. There are a lot of really useful guides out there but here I try to give simplistic examples. Let’s get started by…

Continue Reading

Analysing miRNA expression in cancers

MiRNAs are a class of small RNAs that when expressed usually down regulates the expression of its target transcript by binding to it and causing it to degrade or inhibiting it from being translated. There has been a lot of interest in studying the expression pattern of miRNAs, especially in relation to cancer, since their…

Continue Reading