Using Gviz

Updated: 2013 November 15th

A while ago I asked on Twitter, what are some tools that people use to visualise hundreds of bam files. One of the suggestions was Gviz (thanks Sebastian!) and I had a quick glimpse at the Bioconductor package and the plots looked really great! Here I use Gviz to plot features along a reference sequence and for visualising bam files.

From the vignette:

The Gviz package aims to provide a structured visualisation framework for plotting any type of data along genomic coordinates. The fundamental concept behind the Gviz package is similar to the approach taken by most genome browsers, in that individual types of genomic features or data are represented by separate tracks.

Let's get started with following the example as per the vignette:

#install the package
source("http://bioconductor.org/biocLite.R")
biocLite("Gviz")

#load the package
library(Gviz)

#following the example
library(GenomicRanges)
data(cpgIslands)
class(cpgIslands)
[1] "GRanges"
attr(,"package")
[1] "GenomicRanges"

chr <- as.character(unique(seqnames(cpgIslands)))
gen <- genome(cpgIslands)
atrack <- AnnotationTrack(cpgIslands, name = "CpG")
gtrack <- GenomeAxisTrack()
itrack <- IdeogramTrack(genome = gen, chromosome = chr)
data(geneModels)
grtrack <- GeneRegionTrack(geneModels, genome = gen, chromosome = chr, name = "Gene Model")
plotTracks(list(itrack, gtrack, atrack, grtrack))

vignette_plotLooks really nice!

Plotting features along a reference

I was interested in plotting features mapped to my own reference sequence and wondered if I could use Gviz to achieve this. So firstly I had to create a GRanges object for my reference and then I used the GenomeAxisTrack() function to create my own reference track. Next I created another GRanges object for my features and used the AnnotationTrack() function to create my features track. Then I simply used the plotTracks() function to plot the 2 tracks.

Of note is that I used 'chr1' as a dummy name for my reference. Gviz complains when I use another name that's not compatible with their naming convention (or perhaps I'm not using it properly). It works fine since 'chr1' is not displayed.

ref <- GRanges('chr', IRanges(1, 500))
ref_track <- GenomeAxisTrack(ref, lwd=4, fontsize=20)

data <- data.frame(chr=c('chr1','chr1','chr1'),
                   start=c(50,200,400),
                   end=c(75, 250, 500),
                   id=c('one', 'two', 'three'),
                   strand=c('+','-','-'))
data
   chr start end    id strand
1 chr1    50  75   one      +
2 chr1   200 250   two      -
3 chr1   400 500 three      -
data_g <- with(data, GRanges(chr, IRanges(start, end), strand, id=id))
data_g
GRanges with 3 ranges and 1 metadata column:
      seqnames     ranges strand |       id
         <Rle>  <IRanges>  <Rle> | <factor>
  [1]     chr1 [ 50,  75]      + |      one
  [2]     chr1 [200, 250]      - |      two
  [3]     chr1 [400, 500]      - |    three
  ---
  seqlengths:
   chr1
     NA

data_track <- AnnotationTrack(data_g, name = "Features", width = 15, showFeatureId = T, min.height=2)
plotTracks(c(ref_track, data_track))

gviz_exampleFor more parameters and settings, such as adjusting the font, colours, etc., look in the manual.

Using Gviz with bam files

For this first example, download two bam files from ENCODE that correspond to two CAGE experiments prepared on the nuclear fraction of K562 cells using either a poly-A plus and minus protocol:

wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeRikenCage/wgEncodeRikenCageK562NucleusPamAln.bam
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeRikenCage/wgEncodeRikenCageK562NucleusPapAlnRep1.bam

These two files are 476M and 619M in size, however a nice feature of Gviz as mentioned in the vignette:

only the relevant part of the data has to be loaded during the plotting operation, so the underlying data files may be quite large without decreasing the performance or causing too big of a memory footprint.

This is particularly important for me as I do plan on plotting hundreds of bam files. Let's load the files into R:


library(Gviz)

k562_nuc_pam <- 'wgEncodeRikenCageK562NucleusPamAln.bam'
k562_nuc_pap <- 'wgEncodeRikenCageK562NucleusPapAlnRep1.bam'

#index the bam files, if you haven't already
library(Rsamtools)
indexBam(k562_nuc_pam)
indexBam(k562_nuc_pap)

k562_nuc_pam_track <- DataTrack(range = k562_nuc_pam,
                                genome = "hg19",
                                type = "heatmap",
                                name = "Coverage",
                                window = -1,
                                chromosome = "chr1")

k562_nuc_pap_track <- DataTrack(range = k562_nuc_pap,
                                genome = "hg19",
                                type = "heatmap",
                                name = "Coverage",
                                window = -1,
                                chromosome = "chr1")

plotTracks(list(k562_nuc_pam_track, k562_nuc_pap_track), from = 100000, to = 1000000)

bam_coverage_heatmapThe scales of the heatmaps are different; the peak in the middle of the bottom track makes the expression of other peaks less significant.

Let's add some more features to the plot:

library(Gviz)

k562_nuc_pam <- 'wgEncodeRikenCageK562NucleusPamAln.bam'
k562_nuc_pap <- 'wgEncodeRikenCageK562NucleusPapAlnRep1.bam'

chr <- 'chr7'
gen <- 'hg19'

gtrack <- GenomeAxisTrack()
itrack <- IdeogramTrack(genome = gen, chromosome = chr)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene

class(txdb)
[1] "TranscriptDb"
attr(,"package")
[1] "GenomicFeatures"

txdb
TranscriptDb object:
| Db type: TranscriptDb
| Supporting package: GenomicFeatures
| Data source: UCSC
| Genome: hg19
| Organism: Homo sapiens
| UCSC Table: knownGene
| Resource URL: http://genome.ucsc.edu/
| Type of Gene ID: Entrez Gene ID
| Full dataset: yes
| miRBase build ID: GRCh37.p5
| transcript_nrow: 80922
| exon_nrow: 286852
| cds_nrow: 235842
| Db created by: GenomicFeatures package from Bioconductor
| Creation time: 2013-03-08 09:43:09 -0800 (Fri, 08 Mar 2013)
| GenomicFeatures version at creation time: 1.11.14
| RSQLite version at creation time: 0.11.2
| DBSCHEMAVERSION: 1.0

grtrack <- GeneRegionTrack(txdb, genome = gen,
                           chromosome = chr, name = "UCSC known genes")

k562_nuc_pam_track <- DataTrack(range = k562_nuc_pam,
                                genome = gen,
                                type = "heatmap",
                                name = "Coverage",
                                window = -1,
                                chromosome = chr)

k562_nuc_pap_track <- DataTrack(range = k562_nuc_pap,
                                genome = gen,
                                type = "heatmap",
                                name = "Coverage",
                                window = -1,
                                chromosome = chr)

plotTracks(list(itrack, gtrack, grtrack, k562_nuc_pam_track, k562_nuc_pap_track),
           from = 5566000, to = 5570500)

bam_coverage_heatmap_featuresThat looks better.

40 bam files

Before we go nuts, let's try 40 bam files first.

setwd("to_directory_of_bam_files")
bam_file <- list.files('.', "*.bam$")
length(bam_file)

#index the bam files, if you haven't already
library(Rsamtools)
for (file in bam_file){
  indexBam(file)
}

library(Gviz)
chr <- 'chr7'
gen <- 'hg19'

gtrack <- GenomeAxisTrack()
itrack <- IdeogramTrack(genome = gen, chromosome = chr)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
grtrack <- GeneRegionTrack(txdb, genome = gen,
                           chromosome = chr, name = "UCSC known genes")

#create 40 tracks
track_name <- gsub(pattern=".bam", replacement="_track", x=bam_file)

counter = 0
for (file in bam_file){
  counter = counter + 1
  assign(x=track_name[counter], value=DataTrack(range = bam_file[counter],
                                                genome = gen,
                                                type = "heatmap",
                                                name = "Coverage",
                                                gradient=c('white','black'),
                                                size=0.3,
                                                window = -1,
                                                chromosome = chr)
         )
}

#build list for plotTracks
my_list <- list(itrack, gtrack, grtrack)
for (i in 1:40){ my_list[[i+3]] <- get(track_name[i])}

#coordinates for beta actin
plotTracks(my_list, from = 5566000, to = 5570500)

gviz_actb_40This is CAGE data, so signal is at the 5' end of the transcript.

OK let's go nuts with 988 bam files:

library(Gviz)
bam_file <- list.files('.', "*.bam$")
length(bam_file)
[1] 988

chr <- 'chr7'
gen <- 'hg19'
gtrack <- GenomeAxisTrack()
itrack <- IdeogramTrack(genome = gen, chromosome = chr)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
grtrack <- GeneRegionTrack(txdb, genome = gen,
                           chromosome = chr, name = "UCSC known genes")

track_name <- gsub(pattern=".bam", replacement="_track", x=bam_file)
 
counter = 0
for (file in bam_file){
  counter = counter + 1
  assign(x=track_name[counter], value=DataTrack(range = bam_file[counter],
                                                genome = gen,
                                                type = "heatmap",
                                                name = "Coverage",
                                                size=0.01,
                                                gradient=c('white','black'),
                                                window = -1,
                                                chromosome = chr)
         )
}

#build list for plotTracks
my_list <- list(itrack, gtrack, grtrack)
for (i in 1:length(bam_file)){ my_list[[i+3]] <- get(track_name[i])}
 
#coordinates for beta actin
plotTracks(my_list, from = 5566000, to = 5570500)

gviz_991_tracksAlmost 1,000 tracks!.

Conclusion

Gviz is awesome! It also supports 2bit, fasta, bedGraph, wig, bigWig, GFF and bed files, so we can go more nuts by adding even more tracks.

See also

I used Gviz to create my thesis cover art.

Visualizing genomic features with the Gviz package.

Link to the vignette




Creative Commons License
This work is licensed under a Creative Commons
Attribution 4.0 International License
.
11 comments Add yours
  1. I used this package some time ago to quantitatively visualize sRNA-seq data in the A. thaliana genome. It was a shame that there was no existing ideogram for that genome, but now I’m working with human data so I’m excited to use it again. Your examples with bam files will be very useful for my purposes.

    Nice Post!

    1. Hi Pablo,

      Thanks! I get spoilt working with either human or mouse data 😉

      I’m glad you found the post useful!

      Cheers,

      Dave

  2. Hej, nice Tutorial!
    I’m wondering if its possible to use the same scale for the heatmaps. I have 20bam files, differing in read depth. (e.g sample1: max= 800, sample2: max= 1200 readth/base)
    With your code each heatmap is fitted for the individual maxima. But I want to define a global Maxima and scaled heatmaps based on that. Do you have a quick idea to solve this problem?
    Greets

    1. Thanks! Yes that was something I was supposed to figure out but I haven’t yet. I’ll reply again to your comment, when I’ve worked it out. Or if you figure it out before me, please leave me a comment 🙂

      1. OK, I found the example from Gviz and this is exactly what I wanted:
        data(twoGroups)
        dTrack <- DataTrack(twoGroups, name = "uniform")
        plotTracks(dTrack, type = c("heatmap"), showSampleNames = TRUE,cex.sampleNames = 0.6)

        "twoGroups" is GRanges object with numeric data from different samples. In my case I need mean read counts for each segment.

        For my dataset I made specific target regions (width= 100).
        Segments<- GRanges(chr,start,end)

        The I used getSegmentReadCountsFromBAM() from the cn.mops package. to calculate the read counts for each segment.

        X<-getSegmentReadCountsFromBAM(BAM.files,GR=Segments,mode="paired")

        And finally the data track

        datatrack<-DataTrack(X,
        genome = "hg19",
        type = "heatmap",
        chromosome = chr,name="Heatmap")

        1. The display parameters for DataTrack objects does include a ylim argument. So you can set the range of the y-axis when you construct the data track:

          k562_nuc_pap_track <- DataTrack(
          range = k562_nuc_pap,
          genome = "hg19",
          type = "heatmap",
          name = "Coverage",
          window = -1,
          chromosome = "chr1",
          ylim = c(0,1200)
          )

  3. Hi, A very nice blog things become so much easier.
    I have a very basic question. How to make object similar to two groups.
    I am making the following command but I guess I am doing something wrong because I get an error.
    with(data, GRanges(chr, IRanges(start, end), strand, id=id, control=c(B1,B2,B3),treatment=c(C1,C2,C3)))

    B 1,2,3 and C1,2,3 are numeric values
    Thank you

    1. I can’t remember. It would depend on your hardware, data, and how much data you want to plot.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.