After getting started and getting acquainted with DNA sequencing data, it's finally time to explore DNA variation. A tool that makes this easy is GEMINI and this post briefly demonstrates some of its functionality. I have only used GEMINI sparingly and what I know about the tool is gathered mostly from their documentation and tutorials. Be sure to check them out if you're planning on using GEMINI.
A post on linking OMIM IDs to gene coordinates using biomaRt; this provides a way of representing OMIM IDs on the genome. For those unfamiliar with OMIM, here's the description from the OMIM FAQ:
Online Mendelian Inheritance in Man (OMIM) is a continuously updated catalog of human genes and genetic disorders and traits, with particular focus on the molecular relationship between genetic variation and phenotypic expression.
Transposons have the ability to "jump" around in genomes and sometimes transposons jump into genomic sites occupied by other repetitive elements; these cases are what I refer to as "composite repetitive elements" for the purpose of this post. While almost all DNA transposons and the majority of retrotransposons have lost the ability to move around in the human genome, transposition events that have occurred in the past are captured within the genome sequence. This post is about finding composite repetitive elements in the human genome based on RepeatMasker annotations.
Here's a very short post on how to fetch lincRNAs from Ensembl using R and the biomaRt package. For those who are not familiar with biomaRt, you can check out my older post on biomaRt. Firstly, start R and install the biomaRt package from Bioconductor by copying and pasting the code below:
Updated 2015 February 8th to include some scatter plots of genome size versus repeat content.
I was writing about the make up of genomes today and was looking up statistics on repetitive elements in vertebrate genomes. While I could find individual papers with repetitive element statistics for a particular genome, I was unable to find a summary for a list of vertebrate genomes (but to be honest I didn't look very hard). So I thought I'll make my own and share it on my blog and via figshare. I will use the RepeatMasker annotations provided via the UCSC genome browser.
The Genomic Regions Enrichment of Annotations Tool (GREAT) is a tool that allows you to find enriched ontological terms in a set of genomic regions. This talk (running time ~1 hour) gives an overview of the tool. In brief, GREAT is an alternative to gene-centric enrichment tools such as DAVID and uses a binomial test to test for ontology enrichment. Figure 1b in the GREAT paper explains how GREAT models functional annotations in the genome. The advantage of using a binomial model, is that it takes into account the probability of having a genomic region overlap a region associated with a particular ontology, so that ontologies that are biased in terms of genome coverage are taken into account. GREAT incorporates annotations from 20 ontologies and is available as a web application. As stated in the paper, the utility of GREAT is not limited to just ChIP-seq data and for those who are more interested, check out their paper.
The Bioconductor annotation packages are an extensive collection of annotations. For this post I simply illustrate the basics of probing these annotation packages. For the first example I will use the org.Hs.eg.db package, which provides genome wide annotations for the human genome.
#install if necessary source("http://bioconductor.org/biocLite.R") biocLite("org.Hs.eg.db") #load library library(org.Hs.eg.db) class(org.Hs.eg.db)  "OrgDb" attr(,"package")  "AnnotationDbi"
We can query the package by using the select() function; to find out what we can select and return we can use the keys(), columns() and keytypes() functions: