Using the ENCODE ChIA-PET dataset

Updated: 2014 March 14th From the Wikipedia article: Chromatin Interaction Analysis by Paired-End Tag Sequencing (ChIA-PET) is a technique that incorporates chromatin immunoprecipitation (ChIP)-based enrichment, chromatin proximity ligation, Paired-End Tags, and High-throughput sequencing to determine de novo long-range chromatin interactions genome-wide. Let’s get started on using the ENCODE ChIA-PET dataset by downloading the bed files,…

Continue Reading

Mapping repeats

Most eukaryotic genomes are interspersed with repetitive elements and some of these elements have transcriptional activity, hence they appear when we sequence the RNA population. From the trend of things, some of these elements seem to be important. One strategy for analysing these repeats is to map them to the genome, to see where they…

Continue Reading

Using tabix

Updated 2013 November 20th Tabix as described in the abstract of the paper: Tabix is the first generic tool that indexes position sorted files in TAB-delimited formats such as GFF, BED, PSL, SAM and SQL export, and quickly retrieves features overlapping specified regions. Tabix features include few seek function calls per query, data compression with…

Continue Reading

Defining genomic regions

Updated 2014 June 24th to use GENCODE version 19 RNA sequencing (RNA-Seq) reads are typically mapped back to the genome (or transcriptome in some cases) after sequencing. The next task is to annotate the reads, to see which regions the reads mapped to. Typically one creates an annotation file and compares the coordinates of the…

Continue Reading

Calculating intergenic regions

Intergenic regions are simply loci in the genome demarked by where one gene ends and another starts. To calculate intergenic regions: First create a BED file containing the coordinates of all genes Sort this BED file by chromosome and then by the starting position Merge this BED file using mergeBed Run the script below (works…

Continue Reading

Genome mapability

I know of the genome mapability and uniqueness tracks provided by the UCSC Genome Browser but I was just interested in doing this myself for the hg19 genome. As a test, I investigated chr22, where the base composition is broken down as: Length of chr22 = 51,304,566 A: 9,094,775 C: 8,375,984 G: 8,369,235 T: 9,054,551…

Continue Reading

GENCODE

By now you should have heard about the ENCODE project. GENCODE, summarised as Encyclopædia of genes and gene variants, is a sub-project of ENCODE where the aim is to annotate all evidence-based gene features in the entire human genome with high accuracy. This includes protein coding genes and their isoforms, non coding RNAs and pseudogenes….

Continue Reading

The 1000 Genome Project

The 1000 Genome Project started as an endeavour to help capture, as much as possible, human genetic variation. The results of the pilot phase, are published in Nature. To sequence a person’s genome, many copies of the DNA are broken into short pieces and each piece is sequenced and mapped, and stored in alignment files….

Continue Reading

intersectBed

Updated 2014 June 25th The tool intersectBed is part of the BEDTools suite of tools and performs an intersection between two BED files. For example, given two BED files, you may be interested in finding the entries that overlap. To install the latest version of BEDTools, download the source code from GitHub and compile:

Continue Reading

Sequence conservation in vertebrates

The UCSC Genome Browser provides multiple alignments of 46 vertebrate species and conveniently provides them for download. The multiple alignments show regions of sequence conservation among vertebrates. For more information see http://genome.ucsc.edu/cgi-bin/hgTrackUi?db=hg19&g=cons46way. The multiple alignments are stored as Multiple Alignment Files and there are Perl and Python packages that parse them. The MAF format is…

Continue Reading