Using the GenometriCorr package

I was reading through the bedtools jaccard documentation when I saw the reference “Exploring Massive, Genome Scale Datasets with the GenometriCorr Package”. Firstly for those wondering what the Jaccard index is, it’s a simple metric that is defined as so: $$!J(A,B) = \frac{| A \cap B |}{| A \cup B |}$$ The numerator is the…

Continue Reading

Finding sequence conservation

I have written about sequence conservation in vertebrates previously but without much elaboration, hence I’m writing another post on this topic. An assumption of sequence conservation is that regions that show conservation, are under purifying selection, i.e. alleles that decrease the fitness of an organism are removed, and therefore probably do something important. Protein-coding regions…

Continue Reading

Getting started with Picard

Updated hyperlinks on the 2015 January 26th; please comment if you find any more dead links. Picard is a suite of Java-based command-line utilities that manipulate SAM/BAM files. Currently, I’m analysing some paired-end libraries and I wanted to calculate the average insert size based on the alignments; that’s how I found Picard. While reading the…

Continue Reading

Repetitive elements in vertebrate genomes

Updated 2015 February 8th to include some scatter plots of genome size versus repeat content. I was writing about the make up of genomes today and was looking up statistics on repetitive elements in vertebrate genomes. While I could find individual papers with repetitive element statistics for a particular genome, I was unable to find…

Continue Reading

Genomic Regions Enrichment of Annotations Tool

The Genomic Regions Enrichment of Annotations Tool (GREAT) is a tool that allows you to find enriched ontological terms in a set of genomic regions. This talk (running time ~1 hour) gives an overview of the tool. In brief, GREAT is an alternative to gene-centric enrichment tools such as DAVID and uses a binomial test…

Continue Reading

How mappable is a specific repeat?

If you’ve ever wondered how mappable a specific repeat is, here’s a quick post on creating a plot showing the mappability of a repetitive element along its consensus sequence. Specifically, the consensus sequence of a repeat was taken and sub-sequences were created by a sliding window approach (i.e. moving along the sequence) at 1 bp…

Continue Reading

Bioconductor annotation packages

The Bioconductor annotation packages are an extensive collection of annotations. For this post I simply illustrate the basics of probing these annotation packages. For the first example I will use the org.Hs.eg.db package, which provides genome wide annotations for the human genome. We can query the package by using the select() function; to find out…

Continue Reading

Mapping repeats 2

Updated 10th September 2013 to include LAST I previously looked at mapping repeats with respect to sequencing errors in high throughput sequencing and as one would expect, the accuracy of the mapping decreased when sequencing errors were introduced. I then looked at aligning to unique regions of the genome to get an idea of how…

Continue Reading

Aligning to unique regions

Post updated on the 10th September 2013 after receiving input from the author of LAST I’ve been interested in aligning reads to the repetitive portion of the human genome; in this post I’ll look into how well different short read alignment programs perform when aligning to unique regions of the genome. Firstly to find unique…

Continue Reading

ENCODE mappability and repeats

The ENCODE mappability tracks can be visualised on the UCSC Genome Browser and they provide a sense of how mappable a region of the genome is in terms of short reads or k-mers. On a side note, it seems some people use “mapability” and some use “mappability”; I was taught the CVC rule, so I’m…

Continue Reading