Mapping repeats

Most eukaryotic genomes are interspersed with repetitive elements and some of these elements have transcriptional activity, hence they appear when we sequence the RNA population. From the trend of things, some of these elements seem to be important. One strategy for analysing these repeats is to map them to the genome, to see where they…

Continue Reading

Using the Bioconductor annotation packages

Another post related to this course I’m going through (I can’t link it enough times). I have almost finished with the first day of the course and couldn’t resist writing about this lecture on using the Bioconductor annotation packages. I had not realised that the annotation packages could be queried (pardon my ignorance) in the…

Continue Reading

Using the Bioconductor GenomicRanges package

Updated: 2019 April 4th From the introductory article: The GenomicRanges package serves as the foundation for representing genomic locations within the Bioconductor project. To begin, install the package. The introduction article starts with creating a GRanges object: The GRanges class represents a collection of genomic features that each have a single start and end location…

Continue Reading

Calculating intergenic regions

Intergenic regions are simply loci in the genome demarked by where one gene ends and another starts. To calculate intergenic regions: First create a BED file containing the coordinates of all genes Sort this BED file by chromosome and then by the starting position Merge this BED file using mergeBed Run the script below (works…

Continue Reading

GENCODE

By now you should have heard about the ENCODE project. GENCODE, summarised as Encyclop√¶dia of genes and gene variants, is a sub-project of ENCODE where the aim is to annotate all evidence-based gene features in the entire human genome with high accuracy. This includes protein coding genes and their isoforms, non coding RNAs and pseudogenes….

Continue Reading

Annotating RNA-Seq data

After mapping your reads from an RNA-Seq experiment, usually the next task is identify the transcripts that the reads came from (i.e. annotating RNA-Seq data) and there are many ways of doing so. Here I just describe a rather crude method whereby I download sequence coordinates of hg19 RefSeqs as a BED12 file from the…

Continue Reading

intersectBed

Updated 2014 June 25th The tool intersectBed is part of the BEDTools suite of tools and performs an intersection between two BED files. For example, given two BED files, you may be interested in finding the entries that overlap. To install the latest version of BEDTools, download the source code from GitHub and compile:

Continue Reading

Learning to use biomaRt

In the past I’ve been manually downloading tables of data annoation and parsing them with Perl. I guess it’s time to do things more elegantly. Below is code taken from the biomaRt vignette: Note If you are using Ubuntu and getting a “Cannot find xml2-config” problem while installing XML, a prequisite to biomaRt, try installing…

Continue Reading

GC and AT content of 5′ UTRs, 3′ UTRs and coding exons of RefSeq gene models

Firstly download bed tracks of the 5′ UTR, 3′ UTR and coding exons from the UCSC Table Browser. The RefSeq gene models are in the table called RefGene. After you’ve saved the 3 bed files (e.g. mm9_refgene_090212_5_utr.bed, mm9_refgene_090212_3_utr.bed and mm9_refgene_090212_coding_exon.bed) use the fastaFromBed program from the BEDTools suite and convert the bed file into a…

Continue Reading

Gene deserts

Find regions of the mouse genome devoid of any annotation (ESTs, mRNA, repeats, RefSeq and UCSC genes). Annotation tracks downloaded using the table browser feature of the UCSC Genome Browser. Chromosome sizes of mm9 downloaded from here. Code for finding regions of 10kb devoid of any annotation. In the mm9 genome I found 9,634 10kb…

Continue Reading