# Using the GenometriCorr package

I was reading through the bedtools jaccard documentation when I saw the reference "Exploring Massive, Genome Scale Datasets with the GenometriCorr Package". Firstly for those wondering what the Jaccard index is, it's a simple metric that is defined as so:

The numerator is the number of intersections between A and B, and the denominator is the union of A and B. If you wanted a metric to summarise the amount of intersection between two sets of features, check out bedtools jaccard. Back to the topic, I decided to check out the paper, as the title attracted me and I wanted to learn more about the GenometriCorr package. It turns out that it's an R package.

# Finding sequence conservation

I have written about sequence conservation in vertebrates previously but without much elaboration, hence I'm writing another post on this topic. An assumption of sequence conservation is that regions that show conservation, are under purifying selection, i.e. alleles that decrease the fitness of an organism are removed, and therefore probably do something important. Protein-coding regions are typically well conserved among the genomes of different species, so it's widely accepted that they are useful. Sequences need to be aligned together in order to infer sequence conservation and conveniently, a multiple sequence alignment (MSA) of 46 vertebrate genomes is provided at the UCSC Genome Browser site.

# Getting started with Picard

Picard is a suite of Java-based command-line utilities that manipulate SAM/BAM files. Currently, I'm analysing some paired-end libraries and I wanted to calculate the average insert size based on the alignments; that's how I found Picard. While reading the documentation I realised that Picard has a lot more to offer and hence I'm writing this post to look into specific tools that I thought would be useful.

To get started, download Picard and an alignment (whole genome shotgun sequencing) from the 1000 genomes project that we can use to test Picard:

# Repetitive elements in vertebrate genomes

Updated 2015 February 8th to include some scatter plots of genome size versus repeat content.

I was writing about the make up of genomes today and was looking up statistics on repetitive elements in vertebrate genomes. While I could find individual papers with repetitive element statistics for a particular genome, I was unable to find a summary for a list of vertebrate genomes (but to be honest I didn't look very hard). So I thought I'll make my own and share it on my blog and via figshare. I will use the RepeatMasker annotations provided via the UCSC genome browser.

# Genomic Regions Enrichment of Annotations Tool

The Genomic Regions Enrichment of Annotations Tool (GREAT) is a tool that allows you to find enriched ontological terms in a set of genomic regions. This talk (running time ~1 hour) gives an overview of the tool. In brief, GREAT is an alternative to gene-centric enrichment tools such as DAVID and uses a binomial test to test for ontology enrichment. Figure 1b in the GREAT paper explains how GREAT models functional annotations in the genome. The advantage of using a binomial model, is that it takes into account the probability of having a genomic region overlap a region associated with a particular ontology, so that ontologies that are biased in terms of genome coverage are taken into account. GREAT incorporates annotations from 20 ontologies and is available as a web application. As stated in the paper, the utility of GREAT is not limited to just ChIP-seq data and for those who are more interested, check out their paper.

# How mappable is a specific repeat?

If you've ever wondered how mappable a specific repeat is, here's a quick post on creating a plot showing the mappability of a repetitive element along its consensus sequence. Specifically, the consensus sequence of a repeat was taken and sub-sequences were created by a sliding window approach (i.e. moving along the sequence) at 1 bp intervals and these sub-sequences were mapped to hg19.

I will use bwa for the mapping, so firstly download and compile bwa:

wget http://sourceforge.net/projects/bio-bwa/files/latest/bwa-0.7.7.tar.bz2
tar -xjf bwa-0.7.7.tar.bz2
cd bwa-0.7.7
make

#index hg19
bwa index hg19.fa


# Bioconductor annotation packages

The Bioconductor annotation packages are an extensive collection of annotations. For this post I simply illustrate the basics of probing these annotation packages. For the first example I will use the org.Hs.eg.db package, which provides genome wide annotations for the human genome.

#install if necessary
source("http://bioconductor.org/biocLite.R")
biocLite("org.Hs.eg.db")