I realise I have random posts all over the place, which are also not properly tagged. This page aims to group posts together along with a bit more elaboration. But first what is high throughput bioinformatics? I needed a title for the page name and “bioinformatics dealing with high throughput sequencing data” was too long.
Working with high throughput sequencing many times you have to align your reads to a reference genome. The following post touches on the topic of mapping qualities. Most times you will be dealing with SAM/BAM files and there has been a Perl module developed for this task. When you don’t have a reference genome to align to, an option is de novo assembly and Velvet is a popular program for assembling short reads. One way of setting up your own workstation at home for performing such analyses.
For performing differential tag (i.e. read) expression there are several popular choices. I tend to use DESeq but use edgeR more often. Here I compare the two packages. I also experiment with pooling samples in edgeR. edgeR includes a normalisation method, which I experiment with. One of the most important concepts of calling differential tag expression is estimating the variance. This is done via the common dispersion parameter in edgeR.
After calling a set of tags/genes as being differentially expressed, the next task is often to find out if these tags/genes have any common functionality between them. One way of doing so is by performing a gene ontology enrichment analysis. If you want to do the reverse, you can find genes names according to GO terms. Additionally you may be interested in whether some of your genes in your gene list belong in the same pathway.
Some popular techniques for clustering or grouping datasets includes PCA, hierarchical clustering, hierarchical clustering using your own defined distance matrix, hierarchical clustering with p values and representing data clusters with heatmaps.
For determining the similarity of expression patterns between libraries a popular choice is the Pearson’s correlation and I adapted some Perl code for calculating the Pearson’s correlation. Another popular choice is the Spearman’s rank correlation coefficient and here I show the difference between calculating correlation using a parametric and non-parametric method. For finding coexpression patterns I experiment with an available R package for doing so.
For other non-related work please feel free to poke by clicking on tags, monthly archives, etc. You may also be interested in my bioinformatics wiki page.
Have fun,
Dave