Making slides using R

Updated 2015 January 14th

I'll be giving a talk on analysing CAGE data using R in an upcoming workshop held at our institute early next year. I thought it would be a good time to check out slidify, which is an R package that can create HTML slides from R Markdown, since I'm planning on making slides using R. I will be using RStudio to convert R Markdown (i.e. knit) into a HTML document.

Continue reading

CAGE analysis using the R Bioconductor package CAGEr

This post is outdated; please refer to the official documentation.

Cap Analysis Gene Expression (CAGE) is a molecular technique, developed at RIKEN, which captures all transcription starting sites (TSSs) of an RNA population. The C in CAGE refers to the altered nucleotide at the 5' site of precursor messenger RNA, termed the cap, which CAGE targets and pulls down. The vignette of the CAGEr package has a very nice introduction to CAGE. I'd just like to add that several other CAGE protocol exists, such as HeliScopeCAGE and nanoCAGE. While these protocols all capture TSSs, the biochemical steps are different, especially nanoCAGE, which does not use CAP trapping but template switching. If you're interested in template switching with respect to transcriptome studies, have a look at the introduction of this paper, which I wrote.

In this post I will go through the workflow of the CAGEr package. If you perform CAGE analysis, I highly recommend using this package. It provides the methods/analysis steps that are commonly used in CAGE analyses and eliminates the use of in house scripts/methods. For the first part I will use publicly available FANTOM3 and FANTOM4 data that is conveniently packaged in Bioconductor as part of CAGEr. The second part shows an example session using ENCODE CAGE data, which is conveniently provided as BAM files.

Continue reading

Encyclopedia of DNA elements (ENCODE)

ENCODE is an abbreviation of "Encyclopedia of DNA elements", a project that aims to discover and define the functional elements encoded in the human genome via multiple technologies. Used in this context, the term "functional elements" is used to denote a discrete region of the genome that encodes a defined product (e.g., protein) or a reproducible biochemical signature, such as transcription or a specific chromatin structure. It is now widely appreciated that such signatures, either alone or in combination, mark genomic sequences with important functions, including exons, sites of RNA processing, and transcriptional regulatory elements such as promoters, enhancers, silencers, and insulators.

The data included from the ENCODE project include:

  1. Identification and quantification of RNA species in whole cells and in sub-cellular compartments
  2. Mapping of protein-coding regions
  3. Delineation of chromatin and DNA accessibility and structure with nucleases and chemical probes
  4. Mapping of histone modifications and transcription factor binding sites (TFBSs) by chromatin immunoprecipitation (ChIP)
  5. and measurement of DNA methylation

To facilitate comparison and integration of data, ENCODE data production efforts have prioritised selected sets of cell types. The highest priority set (designated "Tier 1) includes two widely studied immortalised cell lines - K562 erythroleukemia cells; an EBV-immortalised B-lymphoblastoid line and the H1 human embryonic stem cell line. A secondary priority set (Tier 2) includes HeLa-S3 cervical carcinoma cells, HepG2 hepatoblastoma cells, and primary (non-transformed) human umbilical vein endothelial cells (HUVEC), which have limited proliferation potential in culture. A third set (Tier 3) currently comprises more than 100 cell types that are being analysed in selected assays.

A major goal of ENCODE is to manually annotate all protein-coding genes, pseudogenes, and non-coding transcribed loci in the human genome and to catalog the products of transcription including splice isoforms. This annotation process involves consolidation of all evidence of transcripts (cDNA, EST sequences) and proteins from public databases, followed by building gene structures based on supporting experimental data.

Ultimately, each gene or transcript model is assigned one of three confidence levels:

  1. Level 1 includes genes validated by RT-PCR and sequencing, plus consensus psuedogenes.
  2. Level 2 includes manually annotated coding and long non-coding loci that have transcriptional evidence in EMBL/GenBank.
  3. Level 3 includes Ensembl gene predictions in regions not yet manually annotated or for which there is new transcriptional evidence.

The result of ENCODE gene annotation (termed "GENCODE") is a comprehensive catalog of transcripts and gene models.

Another goal of ENCODE is to produce a comprehensive genome-wide catalog of transcribed loci that characterises the size, polyadenylation status, and subcellular compartmentalisation of all transcripts. Both polyA+ and polyA- RNAs are being analysed and not only total whole cell RNAs but also those concentrated in the nucleus and cytosol. Long (>200nt) and short RNAs (<200nt) are being sequenced from each subcellular compartment, providing catalogs of potential miRNAs, snoRNA, promoter-associated short RNAs (PASRs) and other short cellular RNAs. Cis-regulatory regions include diverse functional elements (e.g., promoters, enhancers, silencers, and insulators) that collectively modulate the magnitude, timing, and cell-specificity of gene expression. The ENCODE Project is using multiple approaches to identify cis-regulatory regions, including localising their characteristic chromatin signatures and identifying sites of occupancy of sequence-specific transcription factors. Human cis-regulatory regions characteristically exhibit nuclease hypersensitivity and may show increased solubility after chromatin fixation and fragmentation. Additionally, specific patterns of post-translational histone modifications have been connected with distinct classes of regions such as promoters and enhancers. Chromatin accessibility and histone modifications thus provide independent and complementary annotations of human regulatory DNA. DNaseI hypersensitive sites (DHSs) are being mapped by two techniques: (i) capture of free DNA ends at in vivo DNaseI cleavage sites with biotinylated adapters, followed by digestion with a TypeIIS restiction enzyme to generate ~20 bp DNaseI cleavage site tags and (ii) direct sequencing of DNaseI cleavage sites at the ends of small (<300 bp) DNA fragments released by limiting treatment with DNaseI. For more information see the ENCODE user guide.

Downloading ENCODE data

Release ENCODE data can be download at

For example if you are interested in the H3K4Me1 CHiP-Seq data, bigWig files are provided at You may also be interested in the H3K4Me3 data, since this provides some perspective on the H3K4Me1 data.

To convert the bigWig file to a human readable (i.e. non-binary), download the bigWigToBedGraph executable.

Then convert the bigWig file to bedGraph by running:

bigWigToBedGraph wgEncodeBroadHistoneK562H3k4me1StdSig.bigWig wgEncodeBroadHistoneK562H3k4me1StdSig.bedgraph

The output:

chr1 10100 10125 0.6
chr1 10125 10225 1
chr1 10225 10250 1.36
chr1 10250 10275 2
chr1 10275 10300 3.64

Which can then be transformed into a BED file, for use with intersectBed.

Clustering mapped reads

Updated 2014 October 8th to include an analysis using CAGE data from ENCODE and rewrote parts of the post.

In this post I will write about a read clustering method called paraclu, which allows mapped reads to be clustered together. This is particularly useful when working with CAGE data, where transcription start sites (TSSs) are known to initiate at different positions but are all providing information on the promoter activity of a transcript, so it is useful to cluster the TSSs together. In addition, paraclu allows different levels of clustering, so you can choose the level that you want. Furthermore, studying the clusters at different levels can reveal subtle properties of promoters; this is akin to adjusting the bin size of histograms to see if certain properties arise.

Continue reading

Visualising FANTOM3 human CAGE data

First download the relevant tables from here. Download tss_library_expr_summary.tsv.bz2, tss_summary.tsv.bz2 and rna_lib_summary.tsv.bz2 and bunzip2 the files.

According to information derived from rna_lib_summary.tsv there are 13,897 CAGE tags from the prostate gland (rna_lib_id = HBA).

Double check this:

cat tss_library_expr_summary.tsv | awk '$3=="HBA" {print}' | cut -f5 | perl -e 'while(<>){ chomp; $i+=$_; } print "$i\n";'

cat tss_library_expr_summary.tsv | awk '$3=="HBA" {print}' | cut -f2 | sort -u | wc -l

Genomic coordinates of TSSs can be obtained from tss_summary.tsv. Here's a script which generates bedgraph files of all the CAGE tags for each tissue. Bugs may be present in the code; use at your own risk. Note I commented out strictures because I wanted to use scalars as filehandles (which is ambiguous code).


#use strict;
use warnings;

my $mapping = 'tss_summary.tsv';
my $tss = 'tss_library_expr_summary.tsv';
my $library = 'rna_lib_summary.tsv';

my %library = ();
open(IN,'<',$library) || die "Could not open $library: $!\n";
   my ($lib,$junk,$junk2,$junk3,$desc,@junk) = split(/\t/);
   next unless $lib =~ /^H/;
   $library{$lib} = $desc;

my %mapping = ();

open(IN,'<',$mapping) || die "Could not open $mapping: $!\n";
   my ($tss_id,$junk,$junk2,$chr,$junk3,$strand,$start,$end,@junk) = split(/\t/);
   if ($strand eq 'F'){
      $strand = '+';
   } else {
      $strand = '-';
   $mapping{$tss_id}->{'CHR'} = $chr;
   $mapping{$tss_id}->{'STRAND'} = $strand;
   $mapping{$tss_id}->{'START'} = $start;
   $mapping{$tss_id}->{'END'} = $end;

foreach my $lib (keys %library){

   my $library = $lib;
   open($lib,'>',"${lib}.bed") || die "Could not open $lib.bed for writing: $!\n";

print $lib <track type=bedGraph name="$library{$library}" description="FANTOM3 CAGE $library{$library}" visibility=full color=0,0,255 altColor=255,0,0 priority=20


open(IN,'<',$tss) || die "Could not open $tss: $!\n";
   my ($tss_rna_lib_id,$tss_id,$rna_lib_id,$priming,$tss_tags,$tpm) = split(/\t/);
   next unless $rna_lib_id =~ /^H/;
   if (exists $mapping{$tss_id}){
      my $chr = $mapping{$tss_id}->{'CHR'};
      my $strand = $mapping{$tss_id}->{'STRAND'};
      my $start = $mapping{$tss_id}->{'START'};
      my $end = $mapping{$tss_id}->{'END'};
      if ($strand eq '-'){
         $tss_tags = '-' . $tss_tags;
      print $rna_lib_id join("\t",$chr,$start,$end,$tss_tags),"\n";
   } else {
      warn "Missing information for $tss_id\n";


This will generate 42 bedgraph files.

Double check:

cat HBA.bed | cut -f4 | sed 's/-//' | perl -e 'while(<>){ chomp; $i+=$_; } print "$i\n";'

Loaded two of the bedgraphs into the UCSC Genome Browser.