Updated: 2013 November 15th
A while ago I asked on Twitter, what are some tools that people use to visualise hundreds of bam files. One of the suggestions was Gviz (thanks Sebastian!) and I had a quick glimpse at the Bioconductor package and the plots looked really great! Here I use Gviz to plot features along a reference sequence and for visualising bam files.
From the vignette:
The Gviz package aims to provide a structured visualisation framework for plotting any type of data along genomic coordinates. The fundamental concept behind the Gviz package is similar to the approach taken by most genome browsers, in that individual types of genomic features or data are represented by separate tracks.
Let's get started with following the example as per the vignette:
#install the package source("http://bioconductor.org/biocLite.R") biocLite("Gviz") #load the package library(Gviz) #following the example library(GenomicRanges) data(cpgIslands) class(cpgIslands) [1] "GRanges" attr(,"package") [1] "GenomicRanges" chr <- as.character(unique(seqnames(cpgIslands))) gen <- genome(cpgIslands) atrack <- AnnotationTrack(cpgIslands, name = "CpG") gtrack <- GenomeAxisTrack() itrack <- IdeogramTrack(genome = gen, chromosome = chr) data(geneModels) grtrack <- GeneRegionTrack(geneModels, genome = gen, chromosome = chr, name = "Gene Model") plotTracks(list(itrack, gtrack, atrack, grtrack))
Plotting features along a reference
I was interested in plotting features mapped to my own reference sequence and wondered if I could use Gviz to achieve this. So firstly I had to create a GRanges object for my reference and then I used the GenomeAxisTrack() function to create my own reference track. Next I created another GRanges object for my features and used the AnnotationTrack() function to create my features track. Then I simply used the plotTracks() function to plot the 2 tracks.
Of note is that I used 'chr1' as a dummy name for my reference. Gviz complains when I use another name that's not compatible with their naming convention (or perhaps I'm not using it properly). It works fine since 'chr1' is not displayed.
ref <- GRanges('chr', IRanges(1, 500)) ref_track <- GenomeAxisTrack(ref, lwd=4, fontsize=20) data <- data.frame(chr=c('chr1','chr1','chr1'), start=c(50,200,400), end=c(75, 250, 500), id=c('one', 'two', 'three'), strand=c('+','-','-')) data chr start end id strand 1 chr1 50 75 one + 2 chr1 200 250 two - 3 chr1 400 500 three - data_g <- with(data, GRanges(chr, IRanges(start, end), strand, id=id)) data_g GRanges with 3 ranges and 1 metadata column: seqnames ranges strand | id <Rle> <IRanges> <Rle> | <factor> [1] chr1 [ 50, 75] + | one [2] chr1 [200, 250] - | two [3] chr1 [400, 500] - | three --- seqlengths: chr1 NA data_track <- AnnotationTrack(data_g, name = "Features", width = 15, showFeatureId = T, min.height=2) plotTracks(c(ref_track, data_track))
For more parameters and settings, such as adjusting the font, colours, etc., look in the manual.
Using Gviz with bam files
For this first example, download two bam files from ENCODE that correspond to two CAGE experiments prepared on the nuclear fraction of K562 cells using either a poly-A plus and minus protocol:
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeRikenCage/wgEncodeRikenCageK562NucleusPamAln.bam wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeRikenCage/wgEncodeRikenCageK562NucleusPapAlnRep1.bam
These two files are 476M and 619M in size, however a nice feature of Gviz as mentioned in the vignette:
only the relevant part of the data has to be loaded during the plotting operation, so the underlying data files may be quite large without decreasing the performance or causing too big of a memory footprint.
This is particularly important for me as I do plan on plotting hundreds of bam files. Let's load the files into R:
library(Gviz) k562_nuc_pam <- 'wgEncodeRikenCageK562NucleusPamAln.bam' k562_nuc_pap <- 'wgEncodeRikenCageK562NucleusPapAlnRep1.bam' #index the bam files, if you haven't already library(Rsamtools) indexBam(k562_nuc_pam) indexBam(k562_nuc_pap) k562_nuc_pam_track <- DataTrack(range = k562_nuc_pam, genome = "hg19", type = "heatmap", name = "Coverage", window = -1, chromosome = "chr1") k562_nuc_pap_track <- DataTrack(range = k562_nuc_pap, genome = "hg19", type = "heatmap", name = "Coverage", window = -1, chromosome = "chr1") plotTracks(list(k562_nuc_pam_track, k562_nuc_pap_track), from = 100000, to = 1000000)
The scales of the heatmaps are different; the peak in the middle of the bottom track makes the expression of other peaks less significant.
Let's add some more features to the plot:
library(Gviz) k562_nuc_pam <- 'wgEncodeRikenCageK562NucleusPamAln.bam' k562_nuc_pap <- 'wgEncodeRikenCageK562NucleusPapAlnRep1.bam' chr <- 'chr7' gen <- 'hg19' gtrack <- GenomeAxisTrack() itrack <- IdeogramTrack(genome = gen, chromosome = chr) library(TxDb.Hsapiens.UCSC.hg19.knownGene) txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene class(txdb) [1] "TranscriptDb" attr(,"package") [1] "GenomicFeatures" txdb TranscriptDb object: | Db type: TranscriptDb | Supporting package: GenomicFeatures | Data source: UCSC | Genome: hg19 | Organism: Homo sapiens | UCSC Table: knownGene | Resource URL: http://genome.ucsc.edu/ | Type of Gene ID: Entrez Gene ID | Full dataset: yes | miRBase build ID: GRCh37.p5 | transcript_nrow: 80922 | exon_nrow: 286852 | cds_nrow: 235842 | Db created by: GenomicFeatures package from Bioconductor | Creation time: 2013-03-08 09:43:09 -0800 (Fri, 08 Mar 2013) | GenomicFeatures version at creation time: 1.11.14 | RSQLite version at creation time: 0.11.2 | DBSCHEMAVERSION: 1.0 grtrack <- GeneRegionTrack(txdb, genome = gen, chromosome = chr, name = "UCSC known genes") k562_nuc_pam_track <- DataTrack(range = k562_nuc_pam, genome = gen, type = "heatmap", name = "Coverage", window = -1, chromosome = chr) k562_nuc_pap_track <- DataTrack(range = k562_nuc_pap, genome = gen, type = "heatmap", name = "Coverage", window = -1, chromosome = chr) plotTracks(list(itrack, gtrack, grtrack, k562_nuc_pam_track, k562_nuc_pap_track), from = 5566000, to = 5570500)
40 bam files
Before we go nuts, let's try 40 bam files first.
setwd("to_directory_of_bam_files") bam_file <- list.files('.', "*.bam$") length(bam_file) #index the bam files, if you haven't already library(Rsamtools) for (file in bam_file){ indexBam(file) } library(Gviz) chr <- 'chr7' gen <- 'hg19' gtrack <- GenomeAxisTrack() itrack <- IdeogramTrack(genome = gen, chromosome = chr) library(TxDb.Hsapiens.UCSC.hg19.knownGene) txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene grtrack <- GeneRegionTrack(txdb, genome = gen, chromosome = chr, name = "UCSC known genes") #create 40 tracks track_name <- gsub(pattern=".bam", replacement="_track", x=bam_file) counter = 0 for (file in bam_file){ counter = counter + 1 assign(x=track_name[counter], value=DataTrack(range = bam_file[counter], genome = gen, type = "heatmap", name = "Coverage", gradient=c('white','black'), size=0.3, window = -1, chromosome = chr) ) } #build list for plotTracks my_list <- list(itrack, gtrack, grtrack) for (i in 1:40){ my_list[[i+3]] <- get(track_name[i])} #coordinates for beta actin plotTracks(my_list, from = 5566000, to = 5570500)
This is CAGE data, so signal is at the 5' end of the transcript.
OK let's go nuts with 988 bam files:
library(Gviz) bam_file <- list.files('.', "*.bam$") length(bam_file) [1] 988 chr <- 'chr7' gen <- 'hg19' gtrack <- GenomeAxisTrack() itrack <- IdeogramTrack(genome = gen, chromosome = chr) library(TxDb.Hsapiens.UCSC.hg19.knownGene) txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene grtrack <- GeneRegionTrack(txdb, genome = gen, chromosome = chr, name = "UCSC known genes") track_name <- gsub(pattern=".bam", replacement="_track", x=bam_file) counter = 0 for (file in bam_file){ counter = counter + 1 assign(x=track_name[counter], value=DataTrack(range = bam_file[counter], genome = gen, type = "heatmap", name = "Coverage", size=0.01, gradient=c('white','black'), window = -1, chromosome = chr) ) } #build list for plotTracks my_list <- list(itrack, gtrack, grtrack) for (i in 1:length(bam_file)){ my_list[[i+3]] <- get(track_name[i])} #coordinates for beta actin plotTracks(my_list, from = 5566000, to = 5570500)
Conclusion
Gviz is awesome! It also supports 2bit, fasta, bedGraph, wig, bigWig, GFF and bed files, so we can go more nuts by adding even more tracks.
See also
I used Gviz to create my thesis cover art.

This work is licensed under a Creative Commons
Attribution 4.0 International License.
I used this package some time ago to quantitatively visualize sRNA-seq data in the A. thaliana genome. It was a shame that there was no existing ideogram for that genome, but now I’m working with human data so I’m excited to use it again. Your examples with bam files will be very useful for my purposes.
Nice Post!
Hi Pablo,
Thanks! I get spoilt working with either human or mouse data 😉
I’m glad you found the post useful!
Cheers,
Dave
Hej, nice Tutorial!
I’m wondering if its possible to use the same scale for the heatmaps. I have 20bam files, differing in read depth. (e.g sample1: max= 800, sample2: max= 1200 readth/base)
With your code each heatmap is fitted for the individual maxima. But I want to define a global Maxima and scaled heatmaps based on that. Do you have a quick idea to solve this problem?
Greets
Thanks! Yes that was something I was supposed to figure out but I haven’t yet. I’ll reply again to your comment, when I’ve worked it out. Or if you figure it out before me, please leave me a comment 🙂
OK, I found the example from Gviz and this is exactly what I wanted:
data(twoGroups)
dTrack <- DataTrack(twoGroups, name = "uniform")
plotTracks(dTrack, type = c("heatmap"), showSampleNames = TRUE,cex.sampleNames = 0.6)
"twoGroups" is GRanges object with numeric data from different samples. In my case I need mean read counts for each segment.
For my dataset I made specific target regions (width= 100).
Segments<- GRanges(chr,start,end)
The I used getSegmentReadCountsFromBAM() from the cn.mops package. to calculate the read counts for each segment.
X<-getSegmentReadCountsFromBAM(BAM.files,GR=Segments,mode="paired")
And finally the data track
datatrack<-DataTrack(X,
genome = "hg19",
type = "heatmap",
chromosome = chr,name="Heatmap")
The display parameters for DataTrack objects does include a ylim argument. So you can set the range of the y-axis when you construct the data track:
k562_nuc_pap_track <- DataTrack(
range = k562_nuc_pap,
genome = "hg19",
type = "heatmap",
name = "Coverage",
window = -1,
chromosome = "chr1",
ylim = c(0,1200)
)
Hi, A very nice blog things become so much easier.
I have a very basic question. How to make object similar to two groups.
I am making the following command but I guess I am doing something wrong because I get an error.
with(data, GRanges(chr, IRanges(start, end), strand, id=id, control=c(B1,B2,B3),treatment=c(C1,C2,C3)))
B 1,2,3 and C1,2,3 are numeric values
Thank you
How much time will gviz take to generate some plot with 1000 bam files?
I can’t remember. It would depend on your hardware, data, and how much data you want to plot.